Fork me on GitHub

SQL中的CUBE、ROLLUP、GROUPING用法

在SQL语法中,经常用到GROUP BY来做多维度的聚合。但是遇到多个维度并列聚合的方式,通常是将每个维度用GROUP BY统计后,再使用UNION语法将结果集汇总,但是这样的SQL执行计划会在数据INPUT端从存储引擎获取多次,导致重复获取数据,浪费机器资源。CUBE、ROLLUP、GROUPING的语法可以更高效的做到多维度的聚合

SoftWare Version
Presto 0.149
Hive 1.1.0-cdh5.11.1
Java 1.8.0_121

此文会以Presto On Hive 的方式演示实例,Mysql不同版本支持不同,Oracle、Sql Server的语法方式略有差异

数据准备

建测试表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
CREATE TABLE `wxm_test_fun`(
`id` int COMMENT 'null',
`name` string COMMENT 'null',
`area_id` int COMMENT 'null',
`group_id` int COMMENT 'null')
COMMENT 'wxm_test_fun'
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim'='\t',
'serialization.format'='\t')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

准备数据

1
2
3
4
5
6
7
8
9
10
1	wxm1	000	111
1 wxm2 001 112
2 wxm1 000 111
2 wxm2 001 112
3 wxm3 000 111
3 wxm3 001 112
4 wxm4 000 111
1 wxm4 000 112
4 wxm4 002 111
5 wxm1 003 113

说明聚合维度

四个维度:id(0),name(1),area_id(2),group_id(3)


CUBE

cube会对所有聚合可能进行计算:CUBE(A,B,C),会计算group by A union group by B group by C union group by (AB) union group by (AC) union group by (BC) union group by (BC) union all (ABC)

分组的次数=2ⁿ-1;n为待分组的字段个数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
presto:temp> select id,name,area_id,count(*) count from wxm_test_fun group by cube(id,name,area_id) order by id,name,area_id;

id | name | area_id | count
------+------+---------+-------
1 | wxm1 | 0 | 1
1 | wxm1 | NULL | 1
1 | wxm2 | 1 | 1
1 | wxm2 | NULL | 1
1 | wxm4 | 0 | 1
1 | wxm4 | NULL | 1
1 | NULL | 0 | 2
1 | NULL | 1 | 1
1 | NULL | NULL | 3
2 | wxm1 | 0 | 1
2 | wxm1 | NULL | 1
2 | wxm2 | 1 | 1
2 | wxm2 | NULL | 1
2 | NULL | 0 | 1
2 | NULL | 1 | 1
2 | NULL | NULL | 2
3 | wxm3 | 0 | 1
3 | wxm3 | 1 | 1
3 | wxm3 | NULL | 2
3 | NULL | 0 | 1
3 | NULL | 1 | 1
3 | NULL | NULL | 2
4 | wxm4 | 0 | 1
4 | wxm4 | 2 | 1
4 | wxm4 | NULL | 2
4 | NULL | 0 | 1
4 | NULL | 2 | 1
4 | NULL | NULL | 2
5 | wxm1 | 3 | 1
5 | wxm1 | NULL | 1
5 | NULL | 3 | 1
5 | NULL | NULL | 1
NULL | wxm1 | 0 | 2
NULL | wxm1 | 3 | 1
NULL | wxm1 | NULL | 3
NULL | wxm2 | 1 | 2
NULL | wxm2 | NULL | 2
NULL | wxm3 | 0 | 1
NULL | wxm3 | 1 | 1
NULL | wxm3 | NULL | 2
NULL | wxm4 | 0 | 2
NULL | wxm4 | 2 | 1
NULL | wxm4 | NULL | 3
NULL | NULL | 0 | 5
NULL | NULL | 1 | 3
NULL | NULL | 2 | 1
NULL | NULL | 3 | 1
NULL | NULL | NULL | 10
(48 rows)

Query 20190312_082310_01544_wwvpi, FINISHED, 2 nodes
Splits: 4 total, 4 done (100.00%)
0:00 [10 rows, 158B] [23 rows/s, 374B/s]

2的3次方减1 = 8个维度的整合

  • 等价于8个union
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
presto:temp> select * from (
-> select id,null name,null area_id,count(*) count from wxm_test_fun group by id
-> union
-> select null,name,null,count(*) count from wxm_test_fun group by name
-> union
-> select null,null,area_id,count(*) count from wxm_test_fun group by area_id
-> union
-> select id,name,area_id,count(*) count from wxm_test_fun group by id,name,area_id
-> union
-> select id,name,null,count(*) count from wxm_test_fun group by id,name
-> union
-> select null,name,area_id,count(*) count from wxm_test_fun group by name,area_id
-> union
-> select id,null,area_id,count(*) count from wxm_test_fun group by id,area_id
-> union
-> select null,null,null,count(*) count from wxm_test_fun) order by id,name,area_id;
id | name | area_id | count
------+------+---------+-------
1 | wxm1 | 0 | 1
1 | wxm1 | NULL | 1
1 | wxm2 | 1 | 1
1 | wxm2 | NULL | 1
1 | wxm4 | 0 | 1
1 | wxm4 | NULL | 1
1 | NULL | 0 | 2
1 | NULL | 1 | 1
1 | NULL | NULL | 3
2 | wxm1 | 0 | 1
2 | wxm1 | NULL | 1
2 | wxm2 | 1 | 1
2 | wxm2 | NULL | 1
2 | NULL | 0 | 1
2 | NULL | 1 | 1
2 | NULL | NULL | 2
3 | wxm3 | 0 | 1
3 | wxm3 | 1 | 1
3 | wxm3 | NULL | 2
3 | NULL | 0 | 1
3 | NULL | 1 | 1
3 | NULL | NULL | 2
4 | wxm4 | 0 | 1
4 | wxm4 | 2 | 1
4 | wxm4 | NULL | 2
4 | NULL | 0 | 1
4 | NULL | 2 | 1
4 | NULL | NULL | 2
5 | wxm1 | 3 | 1
5 | wxm1 | NULL | 1
5 | NULL | 3 | 1
5 | NULL | NULL | 1
NULL | wxm1 | 0 | 2
NULL | wxm1 | 3 | 1
NULL | wxm1 | NULL | 3
NULL | wxm2 | 1 | 2
NULL | wxm2 | NULL | 2
NULL | wxm3 | 0 | 1
NULL | wxm3 | 1 | 1
NULL | wxm3 | NULL | 2
NULL | wxm4 | 0 | 2
NULL | wxm4 | 2 | 1
NULL | wxm4 | NULL | 3
NULL | NULL | 0 | 5
NULL | NULL | 1 | 3
NULL | NULL | 2 | 1
NULL | NULL | 3 | 1
NULL | NULL | NULL | 10
(48 rows)

Query 20190312_082943_01568_wwvpi, FINISHED, 2 nodes
Splits: 42 total, 42 done (100.00%)
0:01 [80 rows, 1.23KB] [101 rows/s, 1.57KB/s]

ROLLUP

与CUBE不同,ROLLUP仅仅只展开第一层的维度聚合:分组的次数=待分组的字段数+1

cube会对所有聚合可能进行计算:CUBE(A,B,C),会计算group by y (ABC) union group by (AB) group by C union group by NULL

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
presto:temp>  select name,area_id,group_id,count(*) count from wxm_test_fun group by rollup(name,area_id,group_id) order by name,area_id,group_id,count(*);

name | area_id | group_id | count
------+---------+----------+-------
wxm1 | 0 | 111 | 2
wxm1 | 0 | NULL | 2
wxm1 | 3 | 113 | 1
wxm1 | 3 | NULL | 1
wxm1 | NULL | NULL | 3
wxm2 | 1 | 112 | 2
wxm2 | 1 | NULL | 2
wxm2 | NULL | NULL | 2
wxm3 | 0 | 111 | 1
wxm3 | 0 | NULL | 1
wxm3 | 1 | 112 | 1
wxm3 | 1 | NULL | 1
wxm3 | NULL | NULL | 2
wxm4 | 0 | 111 | 1
wxm4 | 0 | 112 | 1
wxm4 | 0 | NULL | 2
wxm4 | 2 | 111 | 1
wxm4 | 2 | NULL | 1
wxm4 | NULL | NULL | 3
NULL | NULL | NULL | 10
(20 rows)

Query 20190312_085457_01652_wwvpi, FINISHED, 2 nodes
Splits: 4 total, 4 done (100.00%)
0:00 [10 rows, 158B] [29 rows/s, 458B/s]
  • 等价于3 + 1 = 4个union聚合
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
presto:temp> select * from (
-> select name,null area_id,null group_id,count(*) count from wxm_test_fun group by name
-> union
-> select name,area_id,null group_id,count(*) count from wxm_test_fun group by name,area_id
-> union
-> select name,area_id,group_id,count(*) count from wxm_test_fun group by name,area_id,group_id
-> union
-> select null name,null area_id,null group_id,count(*) count from wxm_test_fun) order by name,area_id,group_id;

name | area_id | group_id | count
------+---------+----------+-------
wxm1 | 0 | 111 | 2
wxm1 | 0 | NULL | 2
wxm1 | 3 | 113 | 1
wxm1 | 3 | NULL | 1
wxm1 | NULL | NULL | 3
wxm2 | 1 | 112 | 2
wxm2 | 1 | NULL | 2
wxm2 | NULL | NULL | 2
wxm3 | 0 | 111 | 1
wxm3 | 0 | NULL | 1
wxm3 | 1 | 112 | 1
wxm3 | 1 | NULL | 1
wxm3 | NULL | NULL | 2
wxm4 | 0 | 111 | 1
wxm4 | 0 | 112 | 1
wxm4 | 0 | NULL | 2
wxm4 | 2 | 111 | 1
wxm4 | 2 | NULL | 1
wxm4 | NULL | NULL | 3
NULL | NULL | NULL | 10
(20 rows)

Query 20190312_090034_01674_wwvpi, FINISHED, 2 nodes
Splits: 22 total, 22 done (100.00%)
0:01 [40 rows, 632B] [50 rows/s, 795B/s]

GROUPING SETS

对任意指定分组进行聚合

GROUPING SETS((AB),(A),(B),()),仅仅group by (AB) union group by A union group by B union group by null

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
presto:temp> select id,name,area_id,count(*) count from wxm_test_fun group by grouping sets((id,name),name,area_id,()) order by id,name,area_id;

id | name | area_id | count
------+------+---------+-------
1 | wxm1 | NULL | 1
1 | wxm2 | NULL | 1
1 | wxm4 | NULL | 1
2 | wxm1 | NULL | 1
2 | wxm2 | NULL | 1
3 | wxm3 | NULL | 2
4 | wxm4 | NULL | 2
5 | wxm1 | NULL | 1
NULL | wxm1 | NULL | 3
NULL | wxm2 | NULL | 2
NULL | wxm3 | NULL | 2
NULL | wxm4 | NULL | 3
NULL | NULL | 0 | 5
NULL | NULL | 1 | 3
NULL | NULL | 2 | 1
NULL | NULL | 3 | 1
NULL | NULL | NULL | 10
(17 rows)

Query 20190312_091338_01714_wwvpi, FINISHED, 2 nodes
Splits: 4 total, 4 done (100.00%)
0:00 [10 rows, 158B] [30 rows/s, 485B/s]
  • 等价于grouping sets中的字段union group by
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
presto:temp> select * from (
-> select id,name,null area_id,count(*) count from wxm_test_fun group by id,name
-> union
-> select null id,name,null area_id,count(*) count from wxm_test_fun group by name
-> union
-> select null id,null name,area_id,count(*) count from wxm_test_fun group by area_id
-> union
-> select null id,null name,null area_id,count(*) count from wxm_test_fun) order by id,name,area_id;

id | name | area_id | count
------+------+---------+-------
1 | wxm1 | NULL | 1
1 | wxm2 | NULL | 1
1 | wxm4 | NULL | 1
2 | wxm1 | NULL | 1
2 | wxm2 | NULL | 1
3 | wxm3 | NULL | 2
4 | wxm4 | NULL | 2
5 | wxm1 | NULL | 1
NULL | wxm1 | NULL | 3
NULL | wxm2 | NULL | 2
NULL | wxm3 | NULL | 2
NULL | wxm4 | NULL | 3
NULL | NULL | 0 | 5
NULL | NULL | 1 | 3
NULL | NULL | 2 | 1
NULL | NULL | 3 | 1
NULL | NULL | NULL | 10
(17 rows)

Query 20190312_091542_01721_wwvpi, FINISHED, 2 nodes
Splits: 22 total, 22 done (100.00%)
0:00 [40 rows, 632B] [85 rows/s, 1.33KB/s]

维度信息获取

通常情况下,为了group by 的灵活性,通常会选用grouping sets,这就会出现一个维度标志问题。并不知道这个聚合数据哪个维度,通常可以用case when语句解决

  • 以此句为例
1
select id,name,area_id,count(*) count from wxm_test_fun group by grouping sets((id,name),name,area_id,()) order by id,name,area_id;

(id,name)维度为维度01,name为维度02,area_id为维度03,()为全局维度

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
presto:temp> select 
-> case when id is not null and name is not null and area_id is null then '维度01'
-> when id is null and name is not null and area_id is null then '维度02'
-> when id is null and name is null and area_id is not null then '维度03'
-> when id is null and name is null and area_id is null then '全局维度'
-> end
-> agg_code,id,name,area_id,count(*) count from wxm_test_fun group by grouping sets((id,name),name,area_id,()) order by id,name,area_id;

agg_code | id | name | area_id | count
----------+------+------+---------+-------
维度01 | 1 | wxm1 | NULL | 1
维度01 | 1 | wxm2 | NULL | 1
维度01 | 1 | wxm4 | NULL | 1
维度01 | 2 | wxm1 | NULL | 1
维度01 | 2 | wxm2 | NULL | 1
维度01 | 3 | wxm3 | NULL | 2
维度01 | 4 | wxm4 | NULL | 2
维度01 | 5 | wxm1 | NULL | 1
维度02 | NULL | wxm1 | NULL | 3
维度02 | NULL | wxm2 | NULL | 2
维度02 | NULL | wxm3 | NULL | 2
维度02 | NULL | wxm4 | NULL | 3
维度03 | NULL | NULL | 0 | 5
维度03 | NULL | NULL | 1 | 3
维度03 | NULL | NULL | 2 | 1
维度03 | NULL | NULL | 3 | 1
全局维度 | NULL | NULL | NULL | 10
(17 rows)

Query 20190312_092301_01746_wwvpi, FINISHED, 2 nodes
Splits: 4 total, 4 done (100.00%)
0:01 [10 rows, 158B] [15 rows/s, 242B/s]

转载请注明出处:https://github.com/imperio-wxm


Thank you for your support.