Fork me on GitHub

Hive 优雅的统计表(分区)Size and RowCount

在Hadoop平台运维监控中,Hive表的统计信息通常是作为集群数据质量的关键依据;通常以Hive表的大小、行数、分区数等信息来衡量集群数据量的增减趋势;本文以Hive表Size大小和数据行数作为重点,对于超大表统计时避免过多的资源消耗

两种常规方式统计:

  • 方式一:

size 统计:

1
long size = FileSystem.getContentSummary(new Path(tablePath)).getLength();

rowCount统计:

1
select count(*) from tableName;
  • 方式二:

size统计:

1
hadoop fs -du -s tablePath

rowCount统计(非Orc、Parquet):

1
hadoop fs -text tablePath | wc -l

常规方式在获取rowCount时都有局限性(例如OrcFile无法用hadoop fs -text),并且效率不高

ANALYZE 方式统计

ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], …)] COMPUTE STATISTICS [noscan];

考虑到表的统计值通常是要通过计算后记录到某个地方,执行ANALYZE的时候才会快速显示;为此探究什么情况下导入Hive数据,ANALYZE的结果准确

下文会以TextFile、SequenceFile、OrcFile、ParquetFile,四种常见的Hive表文件格式来做测试

Hive数据导入途径:

1. Load file to hive table

2. Add partition and mv file to partition path

3. Insert into

测试数据文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// 数据文件 count = 20
1 wxmimperio1
2 wxmimperio2
1 wxmimperio1
2 wxmimperio2
1 wxmimperio1
2 wxmimperio2
1 wxmimperio1
2 wxmimperio2
1 wxmimperio1
2 wxmimperio2
1 wxmimperio1
2 wxmimperio2
1 wxmimperio1
2 wxmimperio2
1 wxmimperio1
2 wxmimperio2
1 wxmimperio1
2 wxmimperio2
1 wxmimperio1
2 wxmimperio2

TextFile

1
2
3
4
5
6
// create table 
CREATE TABLE `analyze_text` (`id` string, `name` string) COMMENT 'test' PARTITIONED BY (`part_date` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES (
'field.delim' = '\t',
'serialization.format' = '\t'
) STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
    1. Load file to table
1
2
3
4
5
6
hive> LOAD DATA LOCAL INPATH '/home/hadoop/wxm/test.txt' OVERWRITE INTO TABLE analyze_text partition(part_date='2019-02-21');

Loading data to table dw.analyze_text partition (part_date=2019-02-21)
Partition dw.analyze_text{part_date=2019-02-21} stats: [numFiles=1, numRows=0, totalSize=298, rawDataSize=0]
OK
Time taken: 1.035 seconds
1
2
3
4
5
hive> ANALYZE TABLE analyze_text PARTITION(part_date='2019-02-21') COMPUTE STATISTICS noscan;

Partition dw.analyze_text{part_date=2019-02-21} stats: [numFiles=1, numRows=0, totalSize=298, rawDataSize=0]
OK
Time taken: 0.463 seconds

Load 操作实际上是执行了 mv 操作,将文件移动到表目录下面;ANALYZE 只能查看到numFiles(文件数)和totalSize(分区总大小)

    1. Add partition and mv file to partition path
1
2
3
4
5
// add partition
alter table analyze_text add partition(part_date='2019-02-20');

// cp files
hadoop fs -cp hdfs://sdg/user/hive/warehouse/dw.db/analyze_text/part_date=2019-02-21/* hdfs://sdg/user/hive/warehouse/dw.db/analyze_text/part_date=2019-02-20/
1
2
3
4
5
hive> ANALYZE TABLE analyze_text PARTITION(part_date='2019-02-20') COMPUTE STATISTICS noscan;

Partition dw.analyze_text{part_date=2019-02-20} stats: [numFiles=1, totalSize=298]
OK
Time taken: 0.202 seconds

和Load操作一样,ANALYZE 只能查看到numFiles(文件数)和totalSize(分区总大小)

    1. Insert into
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
hive> insert into table analyze_text partition(part_date='2019-02-19') select `id`,`name` from analyze_text_copy where part_date='2019-02-21';

Query ID = hadoop_20190221143232_fbc19f00-d0af-4278-a644-924c92994a75
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1535945194143_1278, Tracking URL = http://wh-8-211:8088/proxy/application_1535945194143_1278/
Kill Command = /app/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hadoop/bin/hadoop job -kill job_1535945194143_1278
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-02-21 14:32:54,087 Stage-1 map = 0%, reduce = 0%
2019-02-21 14:33:01,381 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.3 sec
MapReduce Total cumulative CPU time: 3 seconds 300 msec
Ended Job = job_1535945194143_1278
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://sdg/user/hive/warehouse/dw.db/analyze_text/part_date=2019-02-19/.hive-staging_hive_2019-02-21_14-32-46_291_8677282162206970479-1/-ext-10000
Loading data to table dw.analyze_text partition (part_date=2019-02-19)
Partition dw.analyze_text{part_date=2019-02-19} stats: [numFiles=1, numRows=20, totalSize=280, rawDataSize=260]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 3.3 sec HDFS Read: 3699 HDFS Write: 373 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 300 msec
OK
Time taken: 20.567 seconds
1
2
3
4
5
hive> ANALYZE TABLE analyze_text PARTITION(part_date='2019-02-19') COMPUTE STATISTICS noscan;

Partition dw.analyze_text{part_date=2019-02-19} stats: [numFiles=1, numRows=20, totalSize=280, rawDataSize=260]
OK
Time taken: 0.197 seconds

由于insert into 用了 MapReduce,在计算的过程中就已经将表的统计信息记录了下来,所以numRows、rawDataSize都有

SequenceFile

1
2
3
4
5
6
// create table 
CREATE TABLE `analyze_sequence_test` (`id` string, `name` string) PARTITIONED BY (`part_date` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES (
'field.delim' = '\t',
'serialization.format' = '\t'
) STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat';
    1. Load file to table
1
2
3
4
5
6
hive> LOAD DATA LOCAL INPATH '/home/hadoop/wxm/analyze_sequence_test_file' OVERWRITE INTO TABLE analyze_sequence_test partition(part_date='2019-02-21');

Loading data to table dw.analyze_sequence_test partition (part_date=2019-02-21)
Partition dw.analyze_sequence_test{part_date=2019-02-21} stats: [numFiles=1, numRows=0, totalSize=607, rawDataSize=0]
OK
Time taken: 1.525 seconds
1
2
3
4
5
hive> ANALYZE TABLE analyze_sequence_test PARTITION(part_date='2019-02-21') COMPUTE STATISTICS noscan;

Partition dw.analyze_sequence_test{part_date=2019-02-21} stats: [numFiles=1, numRows=0, totalSize=607, rawDataSize=0]
OK
Time taken: 0.547 seconds

stats: [numFiles=1, numRows=0, totalSize=607, rawDataSize=0]

    1. Add partition and mv file to partition path
1
2
3
4
5
// add partition
alter table analyze_sequence_test add partition(part_date='2019-02-20');

// cp files
hadoop fs -cp hdfs://sdg/user/hive/warehouse/dw.db/analyze_sequence_test/part_date=2019-02-19/* hdfs://sdg/user/hive/warehouse/dw.db/analyze_sequence_test/part_date=2019-02-20/
1
2
3
4
5
ANALYZE TABLE analyze_sequence_test PARTITION(part_date='2019-02-20') COMPUTE STATISTICS noscan;

Partition dw.analyze_sequence_test{part_date=2019-02-20} stats: [numFiles=1, totalSize=607]
OK
Time taken: 0.875 seconds

由于sequenceFile文件的head中没有存储row的相关信息,所以获取不到

    1. Insert into
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
hive>  insert into table analyze_sequence_test partition(part_date='2019-02-19') select `id`,`name` from analyze_text_copy where part_date='2019-02-21';

Query ID = hadoop_20190221144141_f9ffbdc3-9e68-43d4-af7c-b8f883e84d3a
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1535945194143_1279, Tracking URL = http://wh-8-211:8088/proxy/application_1535945194143_1279/
Kill Command = /app/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hadoop/bin/hadoop job -kill job_1535945194143_1279
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-02-21 14:41:12,146 Stage-1 map = 0%, reduce = 0%
2019-02-21 14:41:18,387 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.94 sec
MapReduce Total cumulative CPU time: 2 seconds 940 msec
Ended Job = job_1535945194143_1279
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://sdg/user/hive/warehouse/dw.db/analyze_sequence_test/part_date=2019-02-19/.hive-staging_hive_2019-02-21_14-41-04_329_3570329183914133571-1/-ext-10000
Loading data to table dw.analyze_sequence_test partition (part_date=2019-02-19)
Partition dw.analyze_sequence_test{part_date=2019-02-19} stats: [numFiles=1, numRows=20, totalSize=607, rawDataSize=260]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 2.94 sec HDFS Read: 3954 HDFS Write: 709 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 940 msec
OK
Time taken: 20.673 seconds
1
2
3
4
5
hive> ANALYZE TABLE analyze_sequence_test PARTITION(part_date='2019-02-19') COMPUTE STATISTICS noscan;

Partition dw.analyze_sequence_test{part_date=2019-02-19} stats: [numFiles=1, numRows=20, totalSize=607, rawDataSize=260]
OK
Time taken: 0.191 seconds

stats: [numFiles=1, numRows=20, totalSize=607, rawDataSize=260]

OrcFile

1
2
3
4
// create table
CREATE TABLE `analyze_orc_test` (`id` string, `name` string) PARTITIONED BY (`part_date` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat';
    1. Load file to table
1
2
3
4
5
6
hive> LOAD DATA LOCAL INPATH '/home/hadoop/wxm/analyze_orc_test_file' OVERWRITE INTO TABLE analyze_orc_test partition(part_date='2019-02-21');

Loading data to table dw.analyze_orc_test partition (part_date=2019-02-21)
Partition dw.analyze_orc_test{part_date=2019-02-21} stats: [numFiles=1, numRows=0, totalSize=365, rawDataSize=0]
OK
Time taken: 0.991 seconds
1
2
3
4
5
hive> ANALYZE TABLE analyze_orc_test PARTITION(part_date='2019-02-21') COMPUTE STATISTICS noscan;

Partition dw.analyze_orc_test{part_date=2019-02-21} stats: [numFiles=1, numRows=20, totalSize=365, rawDataSize=3600]
OK
Time taken: 0.661 seconds

stats: [numFiles=1, numRows=20, totalSize=365, rawDataSize=3600]

    1. Add partition and mv file to partition path
1
2
3
4
5
// add partition
alter table analyze_orc_test add partition(part_date='2019-02-20');

// cp files
hadoop fs -cp hdfs://sdg/user/hive/warehouse/dw.db/analyze_orc_test/part_date=2019-02-19/* hdfs://sdg/user/hive/warehouse/dw.db/analyze_orc_test/part_date=2019-02-20/
1
2
3
4
5
hive> ANALYZE TABLE analyze_orc_test PARTITION(part_date='2019-02-20') COMPUTE STATISTICS noscan;

Partition dw.analyze_orc_test{part_date=2019-02-20} stats: [numFiles=1, numRows=20, totalSize=365, rawDataSize=3600]
OK
Time taken: 0.469 seconds

stats: [numFiles=1, numRows=20, totalSize=365, rawDataSize=3600]

    1. Insert into
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
hive> insert into table analyze_orc_test partition(part_date='2019-02-19') select `id`,`name` from analyze_text_copy where part_date='2019-02-21';

Query ID = hadoop_20190221145353_70e75ca0-210f-48d1-bbd0-b4b0ed01c0cf
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1535945194143_1281, Tracking URL = http://wh-8-211:8088/proxy/application_1535945194143_1281/
Kill Command = /app/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hadoop/bin/hadoop job -kill job_1535945194143_1281
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-02-21 14:53:53,124 Stage-1 map = 0%, reduce = 0%
2019-02-21 14:54:00,420 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.6 sec
MapReduce Total cumulative CPU time: 3 seconds 600 msec
Ended Job = job_1535945194143_1281
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://sdg/user/hive/warehouse/dw.db/analyze_orc_test/part_date=2019-02-19/.hive-staging_hive_2019-02-21_14-53-45_113_8780230061527813380-1/-ext-10000
Loading data to table dw.analyze_orc_test partition (part_date=2019-02-19)
Partition dw.analyze_orc_test{part_date=2019-02-19} stats: [numFiles=1, numRows=20, totalSize=365, rawDataSize=3600]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 3.6 sec HDFS Read: 3947 HDFS Write: 463 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 600 msec
OK
Time taken: 18.592 seconds
1
2
3
4
5
hive> ANALYZE TABLE analyze_orc_test PARTITION(part_date='2019-02-19') COMPUTE STATISTICS noscan;

Partition dw.analyze_orc_test{part_date=2019-02-19} stats: [numFiles=1, numRows=20, totalSize=365, rawDataSize=3600]
OK
Time taken: 0.274 seconds

stats: [numFiles=1, numRows=20, totalSize=365, rawDataSize=3600]

Parquet

1
2
3
4
// create table
CREATE TABLE `analyze_parquet_test` (`id` string, `name` string) PARTITIONED BY (`part_date` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
    1. Load file to table
1
2
3
4
5
6
hive>  LOAD DATA LOCAL INPATH '/home/hadoop/wxm/analyze_parquet_test_file' OVERWRITE INTO TABLE analyze_parquet_test partition(part_date='2019-02-21');

Loading data to table dw.analyze_parquet_test partition (part_date=2019-02-21)
Partition dw.analyze_parquet_test{part_date=2019-02-21} stats: [numFiles=1, numRows=0, totalSize=390, rawDataSize=0]
OK
Time taken: 0.972 seconds
1
2
3
4
5
hive> ANALYZE TABLE analyze_parquet_test PARTITION(part_date='2019-02-21') COMPUTE STATISTICS noscan;

Partition dw.analyze_parquet_test{part_date=2019-02-21} stats: [numFiles=1, numRows=0, totalSize=390, rawDataSize=0]
OK
Time taken: 0.576 seconds

stats: [numFiles=1, numRows=0, totalSize=390, rawDataSize=0]

    1. Add partition and mv file to partition path
1
2
3
4
5
// add partition
alter table analyze_parquet_test add partition(part_date='2019-02-20');

// cp files
hadoop fs -cp hdfs://sdg/user/hive/warehouse/dw.db/analyze_parquet_test/part_date=2019-02-19/* hdfs://sdg/user/hive/warehouse/dw.db/analyze_parquet_test/part_date=2019-02-20/
1
2
3
4
5
hive> ANALYZE TABLE analyze_parquet_test PARTITION(part_date='2019-02-20') COMPUTE STATISTICS noscan;

Partition dw.analyze_parquet_test{part_date=2019-02-20} stats: [numFiles=1, totalSize=390]
OK
Time taken: 0.799 seconds

stats: [numFiles=1, totalSize=390]

    1. Insert into
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
hive> insert into table analyze_parquet_test partition(part_date='2019-02-19') select `id`,`name` from analyze_text_copy where part_date='2019-02-21';

Query ID = hadoop_20190221150303_86874b16-b4c6-4e6d-a2e6-c1a02aaae3b8
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1535945194143_1282, Tracking URL = http://wh-8-211:8088/proxy/application_1535945194143_1282/
Kill Command = /app/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hadoop/bin/hadoop job -kill job_1535945194143_1282
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-02-21 15:03:27,069 Stage-1 map = 0%, reduce = 0%
2019-02-21 15:03:35,527 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.98 sec
MapReduce Total cumulative CPU time: 4 seconds 980 msec
Ended Job = job_1535945194143_1282
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://sdg/user/hive/warehouse/dw.db/analyze_parquet_test/part_date=2019-02-19/.hive-staging_hive_2019-02-21_15-03-17_975_7489031492557604858-1/-ext-10000
Loading data to table dw.analyze_parquet_test partition (part_date=2019-02-19)
Partition dw.analyze_parquet_test{part_date=2019-02-19} stats: [numFiles=1, numRows=20, totalSize=390, rawDataSize=40]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 4.98 sec HDFS Read: 4039 HDFS Write: 490 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 980 msec
OK
Time taken: 22.794 seconds
1
2
3
4
5
hive> ANALYZE TABLE analyze_parquet_test PARTITION(part_date='2019-02-19') COMPUTE STATISTICS noscan;

Partition dw.analyze_parquet_test{part_date=2019-02-19} stats: [numFiles=1, numRows=20, totalSize=390, rawDataSize=40]
OK
Time taken: 0.226 seconds

stats: [numFiles=1, numRows=20, totalSize=390, rawDataSize=40]

Result

通过上述测试可以看出,OrcFile文件的三种数据导入方式可以直接通过ANALYZE获取完整信息,其他文件格式默认只能获取到Size,除非insert into触发了MapReduce时才能获取RowCount

ANALYZE耗时非常短,文件的统计数据一定是存储在某个地方不需要RunTime去计算(查询了Hive MetaStore,并没有发现表Row的统计存储),猜想原因在于OrcFile的自描述Header里存储了对Row的统计信息,ANALYZE会直接获取自描述文件中的统计信息,可以看下文件的Dump

Hive Orc File Dump

Hive Orc Doc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// Orc Dump

hive --orcfiledump hdfs://sdg/user/hive/warehouse/dw.db/analyze_orc_test/part_date=2019-02-19/000000_0

Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Structure for hdfs://sdg/user/hive/warehouse/dw.db/analyze_orc_test/part_date=2019-02-19/000000_0
File Version: 0.12 with HIVE_8732
19/02/21 15:54:02 INFO orc.ReaderImpl: Reading ORC rows from hdfs://sdg/user/hive/warehouse/dw.db/analyze_orc_test/part_date=2019-02-19/000000_0 with {include: null, offset: 0, length: 9223372036854775807}
Rows: 20
Compression: ZLIB
Compression size: 262144
Type: struct<_col0:string,_col1:string>

Stripe Statistics:
Stripe 1:
Column 0: count: 20 hasNull: false
Column 1: count: 20 hasNull: false min: 1 max: 2 sum: 20
Column 2: count: 20 hasNull: false min: wxmimperio1 max: wxmimperio2 sum: 220

File Statistics:
Column 0: count: 20 hasNull: false
Column 1: count: 20 hasNull: false min: 1 max: 2 sum: 20
Column 2: count: 20 hasNull: false min: wxmimperio1 max: wxmimperio2 sum: 220

Stripes:
Stripe: offset: 3 data: 52 rows: 20 tail: 57 index: 77
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 1 section ROW_INDEX start: 14 length 26
Stream: column 2 section ROW_INDEX start: 40 length 40
Stream: column 1 section DATA start: 80 length 8
Stream: column 1 section LENGTH start: 88 length 6
Stream: column 1 section DICTIONARY_DATA start: 94 length 5
Stream: column 2 section DATA start: 99 length 8
Stream: column 2 section LENGTH start: 107 length 6
Stream: column 2 section DICTIONARY_DATA start: 113 length 19
Encoding column 0: DIRECT
Encoding column 1: DICTIONARY_V2[2]
Encoding column 2: DICTIONARY_V2[2]

File length: 365 bytes
Padding length: 0 bytes
Padding ratio: 0%

可以看到Dump信息里面有Rows: 20的信息,所以可以确定ANALYZE 命令是分析了file的meta

Parquet File Dump

Parquet Tools

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
hadoop jar parquet-tools-1.5.0.jar dump hdfs://sdg/user/hive/warehouse/dw.db/analyze_parquet_test/part_date=2019-02-19/000000_0

row group 0
--------------------------------------------------------------------------------
id: BINARY UNCOMPRESSED DO:0 FPO:4 SZ:61/61/1.00 VC:20 ENC:BIT_PACK [more]...
name: BINARY UNCOMPRESSED DO:0 FPO:65 SZ:101/101/1.00 VC:20 ENC:BIT_P [more]...

id TV=20 RL=0 DL=1 DS: 2 DE:PLAIN_DICTIONARY
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY SZ:11 [more]...

name TV=20 RL=0 DL=1 DS: 2 DE:PLAIN_DICTIONARY
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY SZ:11 [more]...

BINARY id
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 20 ***
value 1: R:0 D:1 V:1
value 2: R:0 D:1 V:2
value 3: R:0 D:1 V:1
value 4: R:0 D:1 V:2
value 5: R:0 D:1 V:1
value 6: R:0 D:1 V:2
value 7: R:0 D:1 V:1
value 8: R:0 D:1 V:2
value 9: R:0 D:1 V:1
value 10: R:0 D:1 V:2
value 11: R:0 D:1 V:1
value 12: R:0 D:1 V:2
value 13: R:0 D:1 V:1
value 14: R:0 D:1 V:2
value 15: R:0 D:1 V:1
value 16: R:0 D:1 V:2
value 17: R:0 D:1 V:1
value 18: R:0 D:1 V:2
value 19: R:0 D:1 V:1
value 20: R:0 D:1 V:2

BINARY name
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 20 ***
value 1: R:0 D:1 V:wxmimperio1
value 2: R:0 D:1 V:wxmimperio2
value 3: R:0 D:1 V:wxmimperio1
value 4: R:0 D:1 V:wxmimperio2
value 5: R:0 D:1 V:wxmimperio1
value 6: R:0 D:1 V:wxmimperio2
value 7: R:0 D:1 V:wxmimperio1
value 8: R:0 D:1 V:wxmimperio2
value 9: R:0 D:1 V:wxmimperio1
value 10: R:0 D:1 V:wxmimperio2
value 11: R:0 D:1 V:wxmimperio1
value 12: R:0 D:1 V:wxmimperio2
value 13: R:0 D:1 V:wxmimperio1
value 14: R:0 D:1 V:wxmimperio2
value 15: R:0 D:1 V:wxmimperio1
value 16: R:0 D:1 V:wxmimperio2
value 17: R:0 D:1 V:wxmimperio1
value 18: R:0 D:1 V:wxmimperio2
value 19: R:0 D:1 V:wxmimperio1
value 20: R:0 D:1 V:wxmimperio2

虽然看到Dump文件中有TV=20,但是不像ORC File 有完整的Row信息,meta还是基于列的,所以读取不到

MapReduce 统计

我们想要获取完整的表统计信息,可以将ANALYZE命令的noscan去掉执行,则会触发一个MapReduce,这个MR会对表文件做一个统计,并将结果存储到Hive MetaStore中,后续在用ANALYZE分析就会直接得到结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
hive> ANALYZE TABLE analyze_parquet_test PARTITION(part_date='2019-02-21') COMPUTE STATISTICS;

Query ID = hadoop_20190221173030_d9cc4e9c-73f6-403e-bfbc-c3cf381b373d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1535945194143_1284, Tracking URL = http://wh-8-211:8088/proxy/application_1535945194143_1284/
Kill Command = /app/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hadoop/bin/hadoop job -kill job_1535945194143_1284
Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0
2019-02-21 17:30:54,865 Stage-0 map = 0%, reduce = 0%
2019-02-21 17:31:02,248 Stage-0 map = 100%, reduce = 0%, Cumulative CPU 4.21 sec
MapReduce Total cumulative CPU time: 4 seconds 210 msec
Ended Job = job_1535945194143_1284
Partition dw.analyze_parquet_test{part_date=2019-02-21} stats: [numFiles=1, numRows=20, totalSize=390, rawDataSize=60]
MapReduce Jobs Launched:
Stage-Stage-0: Map: 1 Cumulative CPU: 4.21 sec HDFS Read: 2931 HDFS Write: 100 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 210 msec
OK
Time taken: 21.214 seconds
1
2
3
4
5
hive> ANALYZE TABLE analyze_parquet_test PARTITION(part_date='2019-02-21') COMPUTE STATISTICS noscan;

Partition dw.analyze_parquet_test{part_date=2019-02-21} stats: [numFiles=1, numRows=20, totalSize=390, rawDataSize=60]
OK
Time taken: 0.219 seconds

总结

我们可以看到利用ANALYZE命令可以快速帮助我们分析一张表的基本统计信息,但也有缺点:如果是分区表,写分区则会列出这个分区的统计,不写分区会列出所有分区的数据列表;并没有一个队整张表进行的汇总统计,所以通常需要配合最开始提到的两种常用方法

对于OrcFile则直接可以用ANALYZE命令分析,如果是TextFile、SequenceFile、ParquetFile,则需要确保Hive 表的数据加载方式是通过MapReduce,其他方式Load数据的则需要执行noscan 的ANALYZE(可以准备离线任务,每天对非MR Load的数据表执行一次);也可以在执行ETL的时候,将统计作为一个回调自动完成

  • Size
1
long size = FileSystem.getContentSummary(new Path(tablePath)).getLength();

此方法可以获取一个表路径、分区路径以及具体文件的Size大小;Size是压缩后的值

全表的统计可以将tablePath截止到tableName就可以,获取某个分区的统计,则tablePath精确到分区目录

  • RowCount
1
ANALYZE TABLE tablName [PARTITION(part)] COMPUTE STATISTICS noscan;

此方法会列出所有表的分区统计,解析这些统计并将numRows加总就是全表的RowCount;如果是非分区表或只想获取当前分区的统计,则只解析一条记录


转载请注明出处:https://github.com/imperio-wxm


Thank you for your support.