Hive における Window関数 を真面目に理解する - 分析関数編
Written by @ryysud Aug 9, 2018 00:00 · 4310 words · 9 minutes read
はじめに
まえおき
Hive における Window関数 を真面目に理解する - 集約関数編 の第2弾となります!基本的なところは第1段の記事にまとめてありますので、そちらを先に参照頂けますと幸いです。
公式ドキュメント
LanguageManual WindowingAndAnalytics - Apache Hive
検証環境
EMR を起動して Hive 2.3.3 で検証を実施していきます。
検証に利用するサンプルテーブル
挙動がわかりやすいようにシンプルなテーブルを用意しました。
> select * from students;
+--------------+----------------+-----------------+-----------------+
| students.id | students.name | students.class | students.score |
+--------------+----------------+-----------------+-----------------+
| 1 | taro | a | 45 |
| 2 | keita | b | 30 |
| 3 | saori | c | 90 |
| 4 | manami | a | 100 |
| 5 | takeshi | c | 75 |
| 6 | shinya | a | 80 |
| 7 | yui | b | 55 |
| 8 | kenta | c | 75 |
| 9 | asami | b | 65 |
| 10 | takaya | a | 10 |
+--------------+----------------+-----------------+-----------------+
分析関数編 row_number(), rank(), *_rank(), cume_dist(), ntile()
row_number() で行番号を振る
名前で並び替えられて行番号が振られます。
> select *, row_number() over(order by name) as row_number from students;
+--------------+----------------+-----------------+-----------------+-------------+
| students.id | students.name | students.class | students.score | row_number |
+--------------+----------------+-----------------+-----------------+-------------+
| 9 | asami | b | 65 | 1 |
| 2 | keita | b | 30 | 2 |
| 8 | kenta | c | 75 | 3 |
| 4 | manami | a | 100 | 4 |
| 3 | saori | c | 90 | 5 |
| 6 | shinya | a | 80 | 6 |
| 10 | takaya | a | 10 | 7 |
| 5 | takeshi | c | 75 | 8 |
| 1 | taro | a | 45 | 9 |
| 7 | yui | b | 55 | 10 |
+--------------+----------------+-----------------+-----------------+-------------+
class 毎に名前で並び替えられて行番号が振られます。
> select *, row_number() over(partition by class order by name) as class_row_number from students;
+--------------+----------------+-----------------+-----------------+-------------------+
| students.id | students.name | students.class | students.score | class_row_number |
+--------------+----------------+-----------------+-----------------+-------------------+
| 4 | manami | a | 100 | 1 |
| 6 | shinya | a | 80 | 2 |
| 10 | takaya | a | 10 | 3 |
| 1 | taro | a | 45 | 4 |
| 9 | asami | b | 65 | 1 |
| 2 | keita | b | 30 | 2 |
| 7 | yui | b | 55 | 3 |
| 8 | kenta | c | 75 | 1 |
| 3 | saori | c | 90 | 2 |
| 5 | takeshi | c | 75 | 3 |
+--------------+----------------+-----------------+-----------------+-------------------+
rank() でランキング(同率で番号を飛ばす)を求める
※ 同率で番号を飛ばす = 100,90,90,80 という score でランキングする場合 1,2,2,4 となる
全体で score の高い順にランキング(同率で番号を飛ばす)
> select *, rank() over(order by score desc) as rank from students;
+--------------+----------------+-----------------+-----------------+-------+
| students.id | students.name | students.class | students.score | rank |
+--------------+----------------+-----------------+-----------------+-------+
| 4 | manami | a | 100 | 1 |
| 3 | saori | c | 90 | 2 |
| 6 | shinya | a | 80 | 3 |
| 5 | takeshi | c | 75 | 4 |
| 8 | kenta | c | 75 | 4 |
| 9 | asami | b | 65 | 6 |
| 7 | yui | b | 55 | 7 |
| 1 | taro | a | 45 | 8 |
| 2 | keita | b | 30 | 9 |
| 10 | takaya | a | 10 | 10 |
+--------------+----------------+-----------------+-----------------+-------+
class 毎にランキング(同率で番号を飛ばす)
> select *, rank() over(partition by class order by score desc) as class_rank from students;
+--------------+----------------+-----------------+-----------------+-------------+
| students.id | students.name | students.class | students.score | class_rank |
+--------------+----------------+-----------------+-----------------+-------------+
| 4 | manami | a | 100 | 1 |
| 6 | shinya | a | 80 | 2 |
| 1 | taro | a | 45 | 3 |
| 10 | takaya | a | 10 | 4 |
| 9 | asami | b | 65 | 1 |
| 7 | yui | b | 55 | 2 |
| 2 | keita | b | 30 | 3 |
| 3 | saori | c | 90 | 1 |
| 5 | takeshi | c | 75 | 2 |
| 8 | kenta | c | 75 | 2 |
+--------------+----------------+-----------------+-----------------+-------------+
dense_rank() でランキング(同率で番号を飛ばさない)を求める
※ 同率で番号を飛ばさない = 100,90,90,80 という score でランキングする場合 1,2,2,3 となる
全体で score の高い順にランキング(同率で番号を飛ばさない)
> select *, dense_rank() over(order by score desc) as dense_rank from students;
+--------------+----------------+-----------------+-----------------+-------------+
| students.id | students.name | students.class | students.score | dense_rank |
+--------------+----------------+-----------------+-----------------+-------------+
| 4 | manami | a | 100 | 1 |
| 3 | saori | c | 90 | 2 |
| 6 | shinya | a | 80 | 3 |
| 5 | takeshi | c | 75 | 4 |
| 8 | kenta | c | 75 | 4 |
| 9 | asami | b | 65 | 5 |
| 7 | yui | b | 55 | 6 |
| 1 | taro | a | 45 | 7 |
| 2 | keita | b | 30 | 8 |
| 10 | takaya | a | 10 | 9 |
+--------------+----------------+-----------------+-----------------+-------------+
class 毎にランキング(同率で番号を飛ばさない)
> select *, dense_rank() over(partition by class order by score desc) as class_dense_rank from students;
+--------------+----------------+-----------------+-----------------+-------------------+
| students.id | students.name | students.class | students.score | class_dense_rank |
+--------------+----------------+-----------------+-----------------+-------------------+
| 4 | manami | a | 100 | 1 |
| 6 | shinya | a | 80 | 2 |
| 1 | taro | a | 45 | 3 |
| 10 | takaya | a | 10 | 4 |
| 9 | asami | b | 65 | 1 |
| 7 | yui | b | 55 | 2 |
| 2 | keita | b | 30 | 3 |
| 3 | saori | c | 90 | 1 |
| 5 | takeshi | c | 75 | 2 |
| 8 | kenta | c | 75 | 2 |
+--------------+----------------+-----------------+-----------------+-------------------+
percent_rank() でランキング(%で表示)を求める
※ %で表示 =(rank - 1)/(全行数 - 1)
全体で score の高い順にランキング(%で表示)
> select *, percent_rank() over(order by score desc) as percent_rank from students;
+--------------+----------------+-----------------+-----------------+---------------------+
| students.id | students.name | students.class | students.score | percent_rank |
+--------------+----------------+-----------------+-----------------+---------------------+
| 4 | manami | a | 100 | 0.0 |
| 3 | saori | c | 90 | 0.1111111111111111 |
| 6 | shinya | a | 80 | 0.2222222222222222 |
| 5 | takeshi | c | 75 | 0.3333333333333333 |
| 8 | kenta | c | 75 | 0.3333333333333333 |
| 9 | asami | b | 65 | 0.5555555555555556 |
| 7 | yui | b | 55 | 0.6666666666666666 |
| 1 | taro | a | 45 | 0.7777777777777778 |
| 2 | keita | b | 30 | 0.8888888888888888 |
| 10 | takaya | a | 10 | 1.0 |
+--------------+----------------+-----------------+-----------------+---------------------+
class 毎にランキング(%で表示)
> select *, percent_rank() over(partition by class order by score desc) as class_percent_rank from students;
+--------------+----------------+-----------------+-----------------+---------------------+
| students.id | students.name | students.class | students.score | class_percent_rank |
+--------------+----------------+-----------------+-----------------+---------------------+
| 4 | manami | a | 100 | 0.0 |
| 6 | shinya | a | 80 | 0.3333333333333333 |
| 1 | taro | a | 45 | 0.6666666666666666 |
| 10 | takaya | a | 10 | 1.0 |
| 9 | asami | b | 65 | 0.0 |
| 7 | yui | b | 55 | 0.5 |
| 2 | keita | b | 30 | 1.0 |
| 3 | saori | c | 90 | 0.0 |
| 5 | takeshi | c | 75 | 0.5 |
| 8 | kenta | c | 75 | 0.5 |
+--------------+----------------+-----------------+-----------------+---------------------+
cume_dist() で値の相対位置を求める
※ 値の相対位置 = percent_rank に類似 :(現在の行の位置)/(全行数)
全体で score の高い順に値の相対位置を求める
> select *, cume_dist() over(order by score desc) as cume_dist from students;
+--------------+----------------+-----------------+-----------------+------------+
| students.id | students.name | students.class | students.score | cume_dist |
+--------------+----------------+-----------------+-----------------+------------+
| 4 | manami | a | 100 | 0.1 |
| 3 | saori | c | 90 | 0.2 |
| 6 | shinya | a | 80 | 0.3 |
| 5 | takeshi | c | 75 | 0.5 |
| 8 | kenta | c | 75 | 0.5 |
| 9 | asami | b | 65 | 0.6 |
| 7 | yui | b | 55 | 0.7 |
| 1 | taro | a | 45 | 0.8 |
| 2 | keita | b | 30 | 0.9 |
| 10 | takaya | a | 10 | 1.0 |
+--------------+----------------+-----------------+-----------------+------------+
class 毎に値の相対位置を求める
> select *, cume_dist() over(partition by class order by score desc) as class_cume_dist from students;
+--------------+----------------+-----------------+-----------------+---------------------+
| students.id | students.name | students.class | students.score | class_cume_dist |
+--------------+----------------+-----------------+-----------------+---------------------+
| 4 | manami | a | 100 | 0.25 |
| 6 | shinya | a | 80 | 0.5 |
| 1 | taro | a | 45 | 0.75 |
| 10 | takaya | a | 10 | 1.0 |
| 9 | asami | b | 65 | 0.3333333333333333 |
| 7 | yui | b | 55 | 0.6666666666666666 |
| 2 | keita | b | 30 | 1.0 |
| 3 | saori | c | 90 | 0.3333333333333333 |
| 5 | takeshi | c | 75 | 1.0 |
| 8 | kenta | c | 75 | 1.0 |
+--------------+----------------+-----------------+-----------------+---------------------+
ntile() で数値を用いてレコードをグルーピング
全体で score の高い順に数値を用いてレコードを3つにグルーピング
> select *, ntile(3) over(order by score desc) as ntile from students;
+--------------+----------------+-----------------+-----------------+--------+
| students.id | students.name | students.class | students.score | ntile |
+--------------+----------------+-----------------+-----------------+--------+
| 4 | manami | a | 100 | 1 |
| 3 | saori | c | 90 | 1 |
| 6 | shinya | a | 80 | 1 |
| 5 | takeshi | c | 75 | 1 |
| 8 | kenta | c | 75 | 2 |
| 9 | asami | b | 65 | 2 |
| 7 | yui | b | 55 | 2 |
| 1 | taro | a | 45 | 3 |
| 2 | keita | b | 30 | 3 |
| 10 | takaya | a | 10 | 3 |
+--------------+----------------+-----------------+-----------------+--------+
class 毎に数値を用いてレコードを2つにグルーピング
> select *, ntile(2) over(partition by class order by score desc) as class_ntile from students;
+--------------+----------------+-----------------+-----------------+--------------+
| students.id | students.name | students.class | students.score | class_ntile |
+--------------+----------------+-----------------+-----------------+--------------+
| 4 | manami | a | 100 | 1 |
| 6 | shinya | a | 80 | 1 |
| 1 | taro | a | 45 | 2 |
| 10 | takaya | a | 10 | 2 |
| 9 | asami | b | 65 | 1 |
| 7 | yui | b | 55 | 1 |
| 2 | keita | b | 30 | 2 |
| 3 | saori | c | 90 | 1 |
| 5 | takeshi | c | 75 | 1 |
| 8 | kenta | c | 75 | 2 |
+--------------+----------------+-----------------+-----------------+--------------+
その他関数編 lead(), lag(), first_value(), last_value()
lead() で現在のレコードより後ろのレコードの値を取得する
lead([カラム], [何レコード後ろにするか], [値がない場合に何を返すか]) のように値を指定することが可能になっていて、lead(score) とすると、第2引数はデフォルト値 1 が適用され、第3引数はデフォルト値 NULL が適用されます。
-- どちらも結果は同じ
> select *, lead(score) over(order by score desc) as 1row_lead from students;
> select *, lead(score, 1, null) over(order by score desc) as 1row_lead from students;
+--------------+----------------+-----------------+-----------------+------------+
| students.id | students.name | students.class | students.score | 1row_lead |
+--------------+----------------+-----------------+-----------------+------------+
| 4 | manami | a | 100 | 90 |
| 3 | saori | c | 90 | 80 |
| 6 | shinya | a | 80 | 75 |
| 5 | takeshi | c | 75 | 75 |
| 8 | kenta | c | 75 | 65 |
| 9 | asami | b | 65 | 55 |
| 7 | yui | b | 55 | 45 |
| 1 | taro | a | 45 | 30 |
| 2 | keita | b | 30 | 10 |
| 10 | takaya | a | 10 | NULL |
+--------------+----------------+-----------------+-----------------+------------+
class 毎に処理するとこんな感じ。
> select *, lead(score) over(partition by class order by score desc) as 1row_lead_per_class from students;
+--------------+----------------+-----------------+-----------------+----------------------+
| students.id | students.name | students.class | students.score | 1row_lead_per_class |
+--------------+----------------+-----------------+-----------------+----------------------+
| 4 | manami | a | 100 | 80 |
| 6 | shinya | a | 80 | 45 |
| 1 | taro | a | 45 | 10 |
| 10 | takaya | a | 10 | NULL |
| 9 | asami | b | 65 | 55 |
| 7 | yui | b | 55 | 30 |
| 2 | keita | b | 30 | NULL |
| 3 | saori | c | 90 | 75 |
| 5 | takeshi | c | 75 | 75 |
| 8 | kenta | c | 75 | NULL |
+--------------+----------------+-----------------+-----------------+----------------------+
lag() で現在のレコードより前のレコードの値を取得する
lead() の逆。使い方も同じ。
-- どちらも結果は同じ
> select *, lag(score) over(order by score desc) as 1row_lag from students;
> select *, lag(score, 1, null) over(order by score desc) as 1row_lag from students;
+--------------+----------------+-----------------+-----------------+-----------+
| students.id | students.name | students.class | students.score | 1row_lag |
+--------------+----------------+-----------------+-----------------+-----------+
| 4 | manami | a | 100 | NULL |
| 3 | saori | c | 90 | 100 |
| 6 | shinya | a | 80 | 90 |
| 5 | takeshi | c | 75 | 80 |
| 8 | kenta | c | 75 | 75 |
| 9 | asami | b | 65 | 75 |
| 7 | yui | b | 55 | 65 |
| 1 | taro | a | 45 | 55 |
| 2 | keita | b | 30 | 45 |
| 10 | takaya | a | 10 | 30 |
+--------------+----------------+-----------------+-----------------+-----------+
class 毎に処理するとこんな感じ。
> select *, lag(score) over(partition by class order by score desc) as 1row_lag_per_class from students;
+--------------+----------------+-----------------+-----------------+---------------------+
| students.id | students.name | students.class | students.score | 1row_lag_per_class |
+--------------+----------------+-----------------+-----------------+---------------------+
| 4 | manami | a | 100 | NULL |
| 6 | shinya | a | 80 | 100 |
| 1 | taro | a | 45 | 80 |
| 10 | takaya | a | 10 | 45 |
| 9 | asami | b | 65 | NULL |
| 7 | yui | b | 55 | 65 |
| 2 | keita | b | 30 | 55 |
| 3 | saori | c | 90 | NULL |
| 5 | takeshi | c | 75 | 90 |
| 8 | kenta | c | 75 | 75 |
+--------------+----------------+-----------------+-----------------+---------------------+
first_value() で最初のレコードの値を返す
first_value([カラム], [NULL のとき値をスキップするかどうか]) のように値を指定することが可能になっていて、first_value(score) とすると、第2引数はデフォルト値 false が適用されるので、最初のレコードの値が NULL でもスキップされることはありません。
-- どちらも結果は同じ
> select *, first_value(score) over(order by score desc) as first_value from students;
> select *, first_value(score, false) over(order by score desc) as first_value from students;
+--------------+----------------+-----------------+-----------------+--------------+
| students.id | students.name | students.class | students.score | first_value |
+--------------+----------------+-----------------+-----------------+--------------+
| 4 | manami | a | 100 | 100 |
| 3 | saori | c | 90 | 100 |
| 6 | shinya | a | 80 | 100 |
| 5 | takeshi | c | 75 | 100 |
| 8 | kenta | c | 75 | 100 |
| 9 | asami | b | 65 | 100 |
| 7 | yui | b | 55 | 100 |
| 1 | taro | a | 45 | 100 |
| 2 | keita | b | 30 | 100 |
| 10 | takaya | a | 10 | 100 |
+--------------+----------------+-----------------+-----------------+--------------+
class 毎に処理するとこんな感じ。
> select *, first_value(score) over(partition by class order by score desc) as first_value_per_class from students;
+--------------+----------------+-----------------+-----------------+------------------------+
| students.id | students.name | students.class | students.score | first_value_per_class |
+--------------+----------------+-----------------+-----------------+------------------------+
| 4 | manami | a | 100 | 100 |
| 6 | shinya | a | 80 | 100 |
| 1 | taro | a | 45 | 100 |
| 10 | takaya | a | 10 | 100 |
| 9 | asami | b | 65 | 65 |
| 7 | yui | b | 55 | 65 |
| 2 | keita | b | 30 | 65 |
| 3 | saori | c | 90 | 90 |
| 5 | takeshi | c | 75 | 90 |
| 8 | kenta | c | 75 | 90 |
+--------------+----------------+-----------------+-----------------+------------------------+
last_value() で最後のレコードの値を返す
first_value() の逆。使い方も同じ。
-- どちらも結果は同じ
> select *, last_value(score) over(order by score desc) as last_value from students;
> select *, last_value(score, false) over(order by score desc) as last_value from students;
+--------------+----------------+-----------------+-----------------+-------------+
| students.id | students.name | students.class | students.score | last_value |
+--------------+----------------+-----------------+-----------------+-------------+
| 4 | manami | a | 100 | 100 |
| 3 | saori | c | 90 | 90 |
| 6 | shinya | a | 80 | 80 |
| 5 | takeshi | c | 75 | 75 |
| 8 | kenta | c | 75 | 75 |
| 9 | asami | b | 65 | 65 |
| 7 | yui | b | 55 | 55 |
| 1 | taro | a | 45 | 45 |
| 2 | keita | b | 30 | 30 |
| 10 | takaya | a | 10 | 10 |
+--------------+----------------+-----------------+-----------------+-------------+
んんん!?!?!? first_value() と同じ挙動じゃない…。 last_value() が 10 にならない…。
ググってみると issue が挙げられていた → HIVE-9000 LAST_VALUE Window function returns wrong results
そちらの内容に従ってクエリを実行すると意図するレコードを取得することが出来た。
> select *, last_value(score) over(order by score desc range between unbounded preceding and unbounded following) as last_value from students;
+--------------+----------------+-----------------+-----------------+-------------+
| students.id | students.name | students.class | students.score | last_value |
+--------------+----------------+-----------------+-----------------+-------------+
| 4 | manami | a | 100 | 10 |
| 3 | saori | c | 90 | 10 |
| 6 | shinya | a | 80 | 10 |
| 5 | takeshi | c | 75 | 10 |
| 8 | kenta | c | 75 | 10 |
| 9 | asami | b | 65 | 10 |
| 7 | yui | b | 55 | 10 |
| 1 | taro | a | 45 | 10 |
| 2 | keita | b | 30 | 10 |
| 10 | takaya | a | 10 | 10 |
+--------------+----------------+-----------------+-----------------+-------------+
↑ がなぜ over 構文に追加で値の指定が必要なのかは現時点では理解できていませんが、次なる機会で調べられたらなと思います!
おわりに
集約関数編に続いて、分析関数編もすべて網羅しましたが、rank() や row_number() 以外の関数を使ったケースがないので、今後 ETL 処理用のクエリを書く際に活用できたらと思います。これで Hive における Window関数 を真面目に理解できた(はず)なので、今後も更に Hive力(SQL力)を高めていく所存です。