Hive における Window関数 を真面目に理解する - 分析関数編

Aug 9, 2018 00:00 · 4310 words · 9 minutes read #hive

はじめに

まえおき

Hive における Window関数 を真面目に理解する - 集約関数編 の第2弾となります!基本的なところは第1段の記事にまとめてありますので、そちらを先に参照頂けますと幸いです。

公式ドキュメント

LanguageManual WindowingAndAnalytics - Apache Hive

検証環境

EMR を起動して Hive 2.3.3 で検証を実施していきます。

検証に利用するサンプルテーブル

挙動がわかりやすいようにシンプルなテーブルを用意しました。

> select * from students;
+--------------+----------------+-----------------+-----------------+
| students.id  | students.name  | students.class  | students.score  |
+--------------+----------------+-----------------+-----------------+
| 1            | taro           | a               | 45              |
| 2            | keita          | b               | 30              |
| 3            | saori          | c               | 90              |
| 4            | manami         | a               | 100             |
| 5            | takeshi        | c               | 75              |
| 6            | shinya         | a               | 80              |
| 7            | yui            | b               | 55              |
| 8            | kenta          | c               | 75              |
| 9            | asami          | b               | 65              |
| 10           | takaya         | a               | 10              |
+--------------+----------------+-----------------+-----------------+

分析関数編 row_number(), rank(), *_rank(), cume_dist(), ntile()

row_number() で行番号を振る

名前で並び替えられて行番号が振られます。

> select *, row_number() over(order by name) as row_number from students;
+--------------+----------------+-----------------+-----------------+-------------+
| students.id  | students.name  | students.class  | students.score  | row_number  |
+--------------+----------------+-----------------+-----------------+-------------+
| 9            | asami          | b               | 65              | 1           |
| 2            | keita          | b               | 30              | 2           |
| 8            | kenta          | c               | 75              | 3           |
| 4            | manami         | a               | 100             | 4           |
| 3            | saori          | c               | 90              | 5           |
| 6            | shinya         | a               | 80              | 6           |
| 10           | takaya         | a               | 10              | 7           |
| 5            | takeshi        | c               | 75              | 8           |
| 1            | taro           | a               | 45              | 9           |
| 7            | yui            | b               | 55              | 10          |
+--------------+----------------+-----------------+-----------------+-------------+

class 毎に名前で並び替えられて行番号が振られます。

> select *, row_number() over(partition by class order by name) as class_row_number from students;
+--------------+----------------+-----------------+-----------------+-------------------+
| students.id  | students.name  | students.class  | students.score  | class_row_number  |
+--------------+----------------+-----------------+-----------------+-------------------+
| 4            | manami         | a               | 100             | 1                 |
| 6            | shinya         | a               | 80              | 2                 |
| 10           | takaya         | a               | 10              | 3                 |
| 1            | taro           | a               | 45              | 4                 |
| 9            | asami          | b               | 65              | 1                 |
| 2            | keita          | b               | 30              | 2                 |
| 7            | yui            | b               | 55              | 3                 |
| 8            | kenta          | c               | 75              | 1                 |
| 3            | saori          | c               | 90              | 2                 |
| 5            | takeshi        | c               | 75              | 3                 |
+--------------+----------------+-----------------+-----------------+-------------------+

rank() でランキング(同率で番号を飛ばす)を求める

※ 同率で番号を飛ばす = 100,90,90,80 という score でランキングする場合 1,2,2,4 となる

全体で score の高い順にランキング(同率で番号を飛ばす)

> select *, rank() over(order by score desc) as rank from students;
+--------------+----------------+-----------------+-----------------+-------+
| students.id  | students.name  | students.class  | students.score  | rank  |
+--------------+----------------+-----------------+-----------------+-------+
| 4            | manami         | a               | 100             | 1     |
| 3            | saori          | c               | 90              | 2     |
| 6            | shinya         | a               | 80              | 3     |
| 5            | takeshi        | c               | 75              | 4     |
| 8            | kenta          | c               | 75              | 4     |
| 9            | asami          | b               | 65              | 6     |
| 7            | yui            | b               | 55              | 7     |
| 1            | taro           | a               | 45              | 8     |
| 2            | keita          | b               | 30              | 9     |
| 10           | takaya         | a               | 10              | 10    |
+--------------+----------------+-----------------+-----------------+-------+

class 毎にランキング(同率で番号を飛ばす)

> select *, rank() over(partition by class order by score desc) as class_rank from students;
+--------------+----------------+-----------------+-----------------+-------------+
| students.id  | students.name  | students.class  | students.score  | class_rank  |
+--------------+----------------+-----------------+-----------------+-------------+
| 4            | manami         | a               | 100             | 1           |
| 6            | shinya         | a               | 80              | 2           |
| 1            | taro           | a               | 45              | 3           |
| 10           | takaya         | a               | 10              | 4           |
| 9            | asami          | b               | 65              | 1           |
| 7            | yui            | b               | 55              | 2           |
| 2            | keita          | b               | 30              | 3           |
| 3            | saori          | c               | 90              | 1           |
| 5            | takeshi        | c               | 75              | 2           |
| 8            | kenta          | c               | 75              | 2           |
+--------------+----------------+-----------------+-----------------+-------------+

dense_rank() でランキング(同率で番号を飛ばさない)を求める

※ 同率で番号を飛ばさない = 100,90,90,80 という score でランキングする場合 1,2,2,3 となる

全体で score の高い順にランキング(同率で番号を飛ばさない)

> select *, dense_rank() over(order by score desc) as dense_rank from students;
+--------------+----------------+-----------------+-----------------+-------------+
| students.id  | students.name  | students.class  | students.score  | dense_rank  |
+--------------+----------------+-----------------+-----------------+-------------+
| 4            | manami         | a               | 100             | 1           |
| 3            | saori          | c               | 90              | 2           |
| 6            | shinya         | a               | 80              | 3           |
| 5            | takeshi        | c               | 75              | 4           |
| 8            | kenta          | c               | 75              | 4           |
| 9            | asami          | b               | 65              | 5           |
| 7            | yui            | b               | 55              | 6           |
| 1            | taro           | a               | 45              | 7           |
| 2            | keita          | b               | 30              | 8           |
| 10           | takaya         | a               | 10              | 9           |
+--------------+----------------+-----------------+-----------------+-------------+

class 毎にランキング(同率で番号を飛ばさない)

> select *, dense_rank() over(partition by class order by score desc) as class_dense_rank from students;
+--------------+----------------+-----------------+-----------------+-------------------+
| students.id  | students.name  | students.class  | students.score  | class_dense_rank  |
+--------------+----------------+-----------------+-----------------+-------------------+
| 4            | manami         | a               | 100             | 1                 |
| 6            | shinya         | a               | 80              | 2                 |
| 1            | taro           | a               | 45              | 3                 |
| 10           | takaya         | a               | 10              | 4                 |
| 9            | asami          | b               | 65              | 1                 |
| 7            | yui            | b               | 55              | 2                 |
| 2            | keita          | b               | 30              | 3                 |
| 3            | saori          | c               | 90              | 1                 |
| 5            | takeshi        | c               | 75              | 2                 |
| 8            | kenta          | c               | 75              | 2                 |
+--------------+----------------+-----------------+-----------------+-------------------+

percent_rank() でランキング(%で表示)を求める

※ %で表示 =(rank - 1)/(全行数 - 1)

全体で score の高い順にランキング(%で表示)

> select *, percent_rank() over(order by score desc) as percent_rank from students;
+--------------+----------------+-----------------+-----------------+---------------------+
| students.id  | students.name  | students.class  | students.score  |    percent_rank     |
+--------------+----------------+-----------------+-----------------+---------------------+
| 4            | manami         | a               | 100             | 0.0                 |
| 3            | saori          | c               | 90              | 0.1111111111111111  |
| 6            | shinya         | a               | 80              | 0.2222222222222222  |
| 5            | takeshi        | c               | 75              | 0.3333333333333333  |
| 8            | kenta          | c               | 75              | 0.3333333333333333  |
| 9            | asami          | b               | 65              | 0.5555555555555556  |
| 7            | yui            | b               | 55              | 0.6666666666666666  |
| 1            | taro           | a               | 45              | 0.7777777777777778  |
| 2            | keita          | b               | 30              | 0.8888888888888888  |
| 10           | takaya         | a               | 10              | 1.0                 |
+--------------+----------------+-----------------+-----------------+---------------------+

class 毎にランキング(%で表示)

> select *, percent_rank() over(partition by class order by score desc) as class_percent_rank from students;
+--------------+----------------+-----------------+-----------------+---------------------+
| students.id  | students.name  | students.class  | students.score  | class_percent_rank  |
+--------------+----------------+-----------------+-----------------+---------------------+
| 4            | manami         | a               | 100             | 0.0                 |
| 6            | shinya         | a               | 80              | 0.3333333333333333  |
| 1            | taro           | a               | 45              | 0.6666666666666666  |
| 10           | takaya         | a               | 10              | 1.0                 |
| 9            | asami          | b               | 65              | 0.0                 |
| 7            | yui            | b               | 55              | 0.5                 |
| 2            | keita          | b               | 30              | 1.0                 |
| 3            | saori          | c               | 90              | 0.0                 |
| 5            | takeshi        | c               | 75              | 0.5                 |
| 8            | kenta          | c               | 75              | 0.5                 |
+--------------+----------------+-----------------+-----------------+---------------------+

cume_dist() で値の相対位置を求める

※ 値の相対位置 = percent_rank に類似 :(現在の行の位置)/(全行数)

全体で score の高い順に値の相対位置を求める

> select *, cume_dist() over(order by score desc) as cume_dist from students;
+--------------+----------------+-----------------+-----------------+------------+
| students.id  | students.name  | students.class  | students.score  | cume_dist  |
+--------------+----------------+-----------------+-----------------+------------+
| 4            | manami         | a               | 100             | 0.1        |
| 3            | saori          | c               | 90              | 0.2        |
| 6            | shinya         | a               | 80              | 0.3        |
| 5            | takeshi        | c               | 75              | 0.5        |
| 8            | kenta          | c               | 75              | 0.5        |
| 9            | asami          | b               | 65              | 0.6        |
| 7            | yui            | b               | 55              | 0.7        |
| 1            | taro           | a               | 45              | 0.8        |
| 2            | keita          | b               | 30              | 0.9        |
| 10           | takaya         | a               | 10              | 1.0        |
+--------------+----------------+-----------------+-----------------+------------+

class 毎に値の相対位置を求める

> select *, cume_dist() over(partition by class order by score desc) as class_cume_dist from students;
+--------------+----------------+-----------------+-----------------+---------------------+
| students.id  | students.name  | students.class  | students.score  |   class_cume_dist   |
+--------------+----------------+-----------------+-----------------+---------------------+
| 4            | manami         | a               | 100             | 0.25                |
| 6            | shinya         | a               | 80              | 0.5                 |
| 1            | taro           | a               | 45              | 0.75                |
| 10           | takaya         | a               | 10              | 1.0                 |
| 9            | asami          | b               | 65              | 0.3333333333333333  |
| 7            | yui            | b               | 55              | 0.6666666666666666  |
| 2            | keita          | b               | 30              | 1.0                 |
| 3            | saori          | c               | 90              | 0.3333333333333333  |
| 5            | takeshi        | c               | 75              | 1.0                 |
| 8            | kenta          | c               | 75              | 1.0                 |
+--------------+----------------+-----------------+-----------------+---------------------+

ntile() で数値を用いてレコードをグルーピング

全体で score の高い順に数値を用いてレコードを3つにグルーピング

> select *, ntile(3) over(order by score desc) as ntile from students;
+--------------+----------------+-----------------+-----------------+--------+
| students.id  | students.name  | students.class  | students.score  | ntile  |
+--------------+----------------+-----------------+-----------------+--------+
| 4            | manami         | a               | 100             | 1      |
| 3            | saori          | c               | 90              | 1      |
| 6            | shinya         | a               | 80              | 1      |
| 5            | takeshi        | c               | 75              | 1      |
| 8            | kenta          | c               | 75              | 2      |
| 9            | asami          | b               | 65              | 2      |
| 7            | yui            | b               | 55              | 2      |
| 1            | taro           | a               | 45              | 3      |
| 2            | keita          | b               | 30              | 3      |
| 10           | takaya         | a               | 10              | 3      |
+--------------+----------------+-----------------+-----------------+--------+

class 毎に数値を用いてレコードを2つにグルーピング

> select *, ntile(2) over(partition by class order by score desc) as class_ntile from students;
+--------------+----------------+-----------------+-----------------+--------------+
| students.id  | students.name  | students.class  | students.score  | class_ntile  |
+--------------+----------------+-----------------+-----------------+--------------+
| 4            | manami         | a               | 100             | 1            |
| 6            | shinya         | a               | 80              | 1            |
| 1            | taro           | a               | 45              | 2            |
| 10           | takaya         | a               | 10              | 2            |
| 9            | asami          | b               | 65              | 1            |
| 7            | yui            | b               | 55              | 1            |
| 2            | keita          | b               | 30              | 2            |
| 3            | saori          | c               | 90              | 1            |
| 5            | takeshi        | c               | 75              | 1            |
| 8            | kenta          | c               | 75              | 2            |
+--------------+----------------+-----------------+-----------------+--------------+

その他関数編 lead(), lag(), first_value(), last_value()

lead() で現在のレコードより後ろのレコードの値を取得する

lead([カラム], [何レコード後ろにするか], [値がない場合に何を返すか]) のように値を指定することが可能になっていて、lead(score) とすると、第2引数はデフォルト値 1 が適用され、第3引数はデフォルト値 NULL が適用されます。

-- どちらも結果は同じ
> select *, lead(score) over(order by score desc) as 1row_lead from students;
> select *, lead(score, 1, null) over(order by score desc) as 1row_lead from students;
+--------------+----------------+-----------------+-----------------+------------+
| students.id  | students.name  | students.class  | students.score  | 1row_lead  |
+--------------+----------------+-----------------+-----------------+------------+
| 4            | manami         | a               | 100             | 90         |
| 3            | saori          | c               | 90              | 80         |
| 6            | shinya         | a               | 80              | 75         |
| 5            | takeshi        | c               | 75              | 75         |
| 8            | kenta          | c               | 75              | 65         |
| 9            | asami          | b               | 65              | 55         |
| 7            | yui            | b               | 55              | 45         |
| 1            | taro           | a               | 45              | 30         |
| 2            | keita          | b               | 30              | 10         |
| 10           | takaya         | a               | 10              | NULL       |
+--------------+----------------+-----------------+-----------------+------------+

class 毎に処理するとこんな感じ。

> select *, lead(score) over(partition by class order by score desc) as 1row_lead_per_class from students;
+--------------+----------------+-----------------+-----------------+----------------------+
| students.id  | students.name  | students.class  | students.score  | 1row_lead_per_class  |
+--------------+----------------+-----------------+-----------------+----------------------+
| 4            | manami         | a               | 100             | 80                   |
| 6            | shinya         | a               | 80              | 45                   |
| 1            | taro           | a               | 45              | 10                   |
| 10           | takaya         | a               | 10              | NULL                 |
| 9            | asami          | b               | 65              | 55                   |
| 7            | yui            | b               | 55              | 30                   |
| 2            | keita          | b               | 30              | NULL                 |
| 3            | saori          | c               | 90              | 75                   |
| 5            | takeshi        | c               | 75              | 75                   |
| 8            | kenta          | c               | 75              | NULL                 |
+--------------+----------------+-----------------+-----------------+----------------------+

lag() で現在のレコードより前のレコードの値を取得する

lead() の逆。使い方も同じ。

-- どちらも結果は同じ
> select *, lag(score) over(order by score desc) as 1row_lag from students;
> select *, lag(score, 1, null) over(order by score desc) as 1row_lag from students;
+--------------+----------------+-----------------+-----------------+-----------+
| students.id  | students.name  | students.class  | students.score  | 1row_lag  |
+--------------+----------------+-----------------+-----------------+-----------+
| 4            | manami         | a               | 100             | NULL      |
| 3            | saori          | c               | 90              | 100       |
| 6            | shinya         | a               | 80              | 90        |
| 5            | takeshi        | c               | 75              | 80        |
| 8            | kenta          | c               | 75              | 75        |
| 9            | asami          | b               | 65              | 75        |
| 7            | yui            | b               | 55              | 65        |
| 1            | taro           | a               | 45              | 55        |
| 2            | keita          | b               | 30              | 45        |
| 10           | takaya         | a               | 10              | 30        |
+--------------+----------------+-----------------+-----------------+-----------+

class 毎に処理するとこんな感じ。

> select *, lag(score) over(partition by class order by score desc) as 1row_lag_per_class from students;
+--------------+----------------+-----------------+-----------------+---------------------+
| students.id  | students.name  | students.class  | students.score  | 1row_lag_per_class  |
+--------------+----------------+-----------------+-----------------+---------------------+
| 4            | manami         | a               | 100             | NULL                |
| 6            | shinya         | a               | 80              | 100                 |
| 1            | taro           | a               | 45              | 80                  |
| 10           | takaya         | a               | 10              | 45                  |
| 9            | asami          | b               | 65              | NULL                |
| 7            | yui            | b               | 55              | 65                  |
| 2            | keita          | b               | 30              | 55                  |
| 3            | saori          | c               | 90              | NULL                |
| 5            | takeshi        | c               | 75              | 90                  |
| 8            | kenta          | c               | 75              | 75                  |
+--------------+----------------+-----------------+-----------------+---------------------+

first_value() で最初のレコードの値を返す

first_value([カラム], [NULL のとき値をスキップするかどうか]) のように値を指定することが可能になっていて、first_value(score) とすると、第2引数はデフォルト値 false が適用されるので、最初のレコードの値が NULL でもスキップされることはありません。

-- どちらも結果は同じ
> select *, first_value(score) over(order by score desc) as first_value from students;
> select *, first_value(score, false) over(order by score desc) as first_value from students;
+--------------+----------------+-----------------+-----------------+--------------+
| students.id  | students.name  | students.class  | students.score  | first_value  |
+--------------+----------------+-----------------+-----------------+--------------+
| 4            | manami         | a               | 100             | 100          |
| 3            | saori          | c               | 90              | 100          |
| 6            | shinya         | a               | 80              | 100          |
| 5            | takeshi        | c               | 75              | 100          |
| 8            | kenta          | c               | 75              | 100          |
| 9            | asami          | b               | 65              | 100          |
| 7            | yui            | b               | 55              | 100          |
| 1            | taro           | a               | 45              | 100          |
| 2            | keita          | b               | 30              | 100          |
| 10           | takaya         | a               | 10              | 100          |
+--------------+----------------+-----------------+-----------------+--------------+

class 毎に処理するとこんな感じ。

> select *, first_value(score) over(partition by class order by score desc) as first_value_per_class from students;
+--------------+----------------+-----------------+-----------------+------------------------+
| students.id  | students.name  | students.class  | students.score  | first_value_per_class  |
+--------------+----------------+-----------------+-----------------+------------------------+
| 4            | manami         | a               | 100             | 100                    |
| 6            | shinya         | a               | 80              | 100                    |
| 1            | taro           | a               | 45              | 100                    |
| 10           | takaya         | a               | 10              | 100                    |
| 9            | asami          | b               | 65              | 65                     |
| 7            | yui            | b               | 55              | 65                     |
| 2            | keita          | b               | 30              | 65                     |
| 3            | saori          | c               | 90              | 90                     |
| 5            | takeshi        | c               | 75              | 90                     |
| 8            | kenta          | c               | 75              | 90                     |
+--------------+----------------+-----------------+-----------------+------------------------+

last_value() で最後のレコードの値を返す

first_value() の逆。使い方も同じ。

-- どちらも結果は同じ
> select *, last_value(score) over(order by score desc) as last_value from students;
> select *, last_value(score, false) over(order by score desc) as last_value from students;
+--------------+----------------+-----------------+-----------------+-------------+
| students.id  | students.name  | students.class  | students.score  | last_value  |
+--------------+----------------+-----------------+-----------------+-------------+
| 4            | manami         | a               | 100             | 100         |
| 3            | saori          | c               | 90              | 90          |
| 6            | shinya         | a               | 80              | 80          |
| 5            | takeshi        | c               | 75              | 75          |
| 8            | kenta          | c               | 75              | 75          |
| 9            | asami          | b               | 65              | 65          |
| 7            | yui            | b               | 55              | 55          |
| 1            | taro           | a               | 45              | 45          |
| 2            | keita          | b               | 30              | 30          |
| 10           | takaya         | a               | 10              | 10          |
+--------------+----------------+-----------------+-----------------+-------------+

んんん!?!?!? first_value() と同じ挙動じゃない…。 last_value() が 10 にならない…。

ググってみると issue が挙げられていた → HIVE-9000 LAST_VALUE Window function returns wrong results

そちらの内容に従ってクエリを実行すると意図するレコードを取得することが出来た。

> select *, last_value(score) over(order by score desc range between unbounded preceding and unbounded following) as last_value from students;
+--------------+----------------+-----------------+-----------------+-------------+
| students.id  | students.name  | students.class  | students.score  | last_value  |
+--------------+----------------+-----------------+-----------------+-------------+
| 4            | manami         | a               | 100             | 10          |
| 3            | saori          | c               | 90              | 10          |
| 6            | shinya         | a               | 80              | 10          |
| 5            | takeshi        | c               | 75              | 10          |
| 8            | kenta          | c               | 75              | 10          |
| 9            | asami          | b               | 65              | 10          |
| 7            | yui            | b               | 55              | 10          |
| 1            | taro           | a               | 45              | 10          |
| 2            | keita          | b               | 30              | 10          |
| 10           | takaya         | a               | 10              | 10          |
+--------------+----------------+-----------------+-----------------+-------------+

↑ がなぜ over 構文に追加で値の指定が必要なのかは現時点では理解できていませんが、次なる機会で調べられたらなと思います!

おわりに

集約関数編に続いて、分析関数編もすべて網羅しましたが、rank() や row_number() 以外の関数を使ったケースがないので、今後 ETL 処理用のクエリを書く際に活用できたらと思います。これで Hive における Window関数 を真面目に理解できた(はず)なので、今後も更に Hive力(SQL力)を高めていく所存です。