Hive基礎(chǔ)（18）：Hive語(yǔ)法(5) DDL(2) 分區(qū)表和分桶表

1 分區(qū)表

分區(qū)表實(shí)際上就是對(duì)應(yīng)一個(gè) HDFS 文件系統(tǒng)上的獨(dú)立的文件夾，該文件夾下是該分區(qū)所有的數(shù)據(jù)文件。Hive 中的分區(qū)就是分目錄，把一個(gè)大的數(shù)據(jù)集根據(jù)業(yè)務(wù)需要分割成小的數(shù)據(jù)集。在查詢時(shí)通過(guò) WHERE 子句中的表達(dá)式選擇查詢所需要的指定的分區(qū)，這樣的查詢效率會(huì)提高很多。

1.1 分區(qū)表基本操作

1）引入分區(qū)表（需要根據(jù)日期對(duì)日志進(jìn)行管理, 通過(guò)部門信息模擬）

dept_20200401.log
dept_20200402.log
dept_20200403.log

2）創(chuàng)建分區(qū)表語(yǔ)法

hive (default)> create table dept_partition(
deptno int, dname string, loc string
)
partitioned by (day string)
row format delimited fields terminated by '\t';

注意：分區(qū)字段不能是表中已經(jīng)存在的數(shù)據(jù)，可以將分區(qū)字段看作表的偽列。

3）加載數(shù)據(jù)到分區(qū)表中

（1）

數(shù)據(jù)準(zhǔn)備

dept_20200401.log

10 ACCOUNTING 1700
20 RESEARCH 1800

dept_20200402.log

30 SALES 1900
40 OPERATIONS 1700

dept_20200403.log

50 TEST 2000
60 DEV 1900

（2）加載數(shù)據(jù)

hive (default)> load data local inpath 
'/opt/module/hive/datas/dept_20200401.log' into table dept_partition 
partition(day='20200401');
hive (default)> load data local inpath 
'/opt/module/hive/datas/dept_20200402.log' into table dept_partition 
partition(day='20200402');
hive (default)> load data local inpath 
'/opt/module/hive/datas/dept_20200403.log' into table dept_partition 
partition(day='20200403');

注意：分區(qū)表加載數(shù)據(jù)時(shí)，必須指定分區(qū)

4）查詢分區(qū)表中數(shù)據(jù)

單分區(qū)查詢

hive (default)> select * from dept_partition where day='20200401';

多分區(qū)聯(lián)合查詢

hive (default)> select * from dept_partition where day='20200401'
 union
 select * from dept_partition where day='20200402'
 union
 select * from dept_partition where day='20200403';
hive (default)> select * from dept_partition where day='20200401' or
 day='20200402' or day='20200403';

5）增加分區(qū)

創(chuàng)建單個(gè)分區(qū)

hive (default)> alter table dept_partition add partition(day='20200404');

同時(shí)創(chuàng)建多個(gè)分區(qū)

hive (default)> alter table dept_partition add partition(day='20200405')

partition(day='20200406');

6）刪除分區(qū)

刪除單個(gè)分區(qū)

hive (default)> alter table dept_partition drop partition

(day='20200406');

同時(shí)刪除多個(gè)分區(qū)

hive (default)> alter table dept_partition drop partition

(day='20200404'), partition(day='20200405');

7）查看分區(qū)表有多少分區(qū)

hive> show partitions dept_partition;

8）查看分區(qū)表結(jié)構(gòu)

hive> desc formatted dept_partition;
# Partition Information 
# col_name data_type comment 
month string

1.2 二級(jí)分區(qū)

思考: 如何一天的日志數(shù)據(jù)量也很大，如何再將數(shù)據(jù)拆分?

1）創(chuàng)建二級(jí)分區(qū)表

hive (default)> create table dept_partition2(
 deptno int, dname string, loc string
 )
 partitioned by (day string, hour string)
 row format delimited fields terminated by '\t';

2）正常的加載數(shù)據(jù)

（1）加載數(shù)據(jù)到二級(jí)分區(qū)表中

hive (default)> load data local inpath 
'/opt/module/hive/datas/dept_20200401.log' into table
dept_partition2 partition(day='20200401', hour='12');

（2）查詢分區(qū)數(shù)據(jù)

hive (default)> select * from dept_partition2 where day='20200401' and 
hour='12';

3）把數(shù)據(jù)直接上傳到分區(qū)目錄上，讓分區(qū)表和數(shù)據(jù)產(chǎn)生關(guān)聯(lián)的三種方式

（1）方式一：上傳數(shù)據(jù)后修復(fù)

上傳數(shù)據(jù)

hive (default)> dfs -mkdir -p
/user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13;
hive (default)> dfs -put /opt/module/datas/dept_20200401.log 
/user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=13;

查詢數(shù)據(jù)（查詢不到剛上傳的數(shù)據(jù)）

hive (default)> select * from dept_partition2 where day='20200401' and 
hour='13';

執(zhí)行修復(fù)命令

hive> msck repair table dept_partition2;

再次查詢數(shù)據(jù)

hive (default)> select * from dept_partition2 where day='20200401' and hour='13';

（2）方式二：上傳數(shù)據(jù)后添加分區(qū)

上傳數(shù)據(jù)

hive (default)> dfs -mkdir -p
/user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=14;
hive (default)> dfs -put /opt/module/hive/datas/dept_20200401.log 
/user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=14;

執(zhí)行添加分區(qū)

hive (default)> alter table dept_partition2 add partition(day='201709',hour='14');

查詢數(shù)據(jù)

hive (default)> select * from dept_partition2 where day='20200401' and hour='14';

（3）方式三：創(chuàng)建文件夾后 load 數(shù)據(jù)到分區(qū)

創(chuàng)建目錄

hive (default)> dfs -mkdir -p
/user/hive/warehouse/mydb.db/dept_partition2/day=20200401/hour=15;

上傳數(shù)據(jù)

hive (default)> load data local inpath 
'/opt/module/hive/datas/dept_20200401.log' into table
dept_partition2 partition(day='20200401',hour='15');

查詢數(shù)據(jù)

hive (default)> select * from dept_partition2 where day='20200401' and hour='15';

1.3 動(dòng)態(tài)分區(qū)調(diào)整

關(guān)系型數(shù)據(jù)庫(kù)中，對(duì)分區(qū)表 Insert 數(shù)據(jù)時(shí)候，數(shù)據(jù)庫(kù)自動(dòng)會(huì)根據(jù)分區(qū)字段的值，將數(shù)據(jù)

插入到相應(yīng)的分區(qū)中，Hive 中也提供了類似的機(jī)制，即動(dòng)態(tài)分區(qū)(Dynamic Partition)，只不過(guò)，

使用 Hive 的動(dòng)態(tài)分區(qū)，需要進(jìn)行相應(yīng)的配置。

1）開啟動(dòng)態(tài)分區(qū)參數(shù)設(shè)置

（1）開啟動(dòng)態(tài)分區(qū)功能（默認(rèn) true，開啟）

hive.exec.dynamic.partition=true

（2）設(shè)置為非嚴(yán)格模式（動(dòng)態(tài)分區(qū)的模式，默認(rèn) strict，表示必須指定至少一個(gè)分區(qū)為

靜態(tài)分區(qū)，nonstrict 模式表示允許所有的分區(qū)字段都可以使用動(dòng)態(tài)分區(qū)。）

hive.exec.dynamic.partition.mode=nonstrict

（3）在所有執(zhí)行 MR 的節(jié)點(diǎn)上，最大一共可以創(chuàng)建多少個(gè)動(dòng)態(tài)分區(qū)。默認(rèn) 1000

hive.exec.max.dynamic.partitions=1000

（4）在每個(gè)執(zhí)行 MR 的節(jié)點(diǎn)上，最大可以創(chuàng)建多少個(gè)動(dòng)態(tài)分區(qū)。該參數(shù)需要根據(jù)實(shí)際

的數(shù)據(jù)來(lái)設(shè)定。比如：源數(shù)據(jù)中包含了一年的數(shù)據(jù)，即 day 字段有 365 個(gè)值，那么該參數(shù)就

需要設(shè)置成大于 365，如果使用默認(rèn)值 100，則會(huì)報(bào)錯(cuò)。

hive.exec.max.dynamic.partitions.pernode=100

（5）整個(gè) MR Job 中，最大可以創(chuàng)建多少個(gè) HDFS 文件。默認(rèn) 100000

hive.exec.max.created.files=100000

（6）當(dāng)有空分區(qū)生成時(shí)，是否拋出異常。一般不需要設(shè)置。默認(rèn) false

hive.error.on.empty.partition=false

2）案例實(shí)操

需求：將 dept 表中的數(shù)據(jù)按照地區(qū)（loc 字段），插入到目標(biāo)表 dept_partition 的相應(yīng)分區(qū)中。

（1）創(chuàng)建目標(biāo)分區(qū)表

hive (default)> create table dept_partition_dy(id int, name string) 
partitioned by (loc int) row format delimited fields terminated by '\t';

（2）設(shè)置動(dòng)態(tài)分區(qū)

set hive.exec.dynamic.partition.mode = nonstrict;
hive (default)> insert into table dept_partition_dy partition(loc) select deptno, dname, loc from dept;

（3）查看目標(biāo)分區(qū)表的分區(qū)情況

hive (default)> show partitions dept_partition;

思考：目標(biāo)分區(qū)表是如何匹配到分區(qū)字段的？

2 分桶表

分區(qū)提供一個(gè)隔離數(shù)據(jù)和優(yōu)化查詢的便利方式。不過(guò)，并非所有的數(shù)據(jù)集都可形成合理的分區(qū)。對(duì)于一張表或者分區(qū)，Hive 可以進(jìn)一步組織成桶，也就是更為細(xì)粒度的數(shù)據(jù)范圍劃分。

　　分桶是將數(shù)據(jù)集分解成更容易管理的若干部分的另一個(gè)技術(shù)。

　　分區(qū)針對(duì)的是數(shù)據(jù)的存儲(chǔ)路徑；分桶針對(duì)的是數(shù)據(jù)文件。

1）先創(chuàng)建分桶表

（1）數(shù)據(jù)準(zhǔn)備

1001 ss1
1002 ss2
1003 ss3
1004 ss4
1005 ss5
1006 ss6
1007 ss7
1008 ss8
1009 ss9
1010 ss10
1011 ss11
1012 ss12
1013 ss13
1014 ss14
1015 ss15
1016 ss16

（2）創(chuàng)建分桶表

create table stu_buck(id int, name string)
clustered by(id) 
into 4 buckets
row format delimited fields terminated by '\t';

（3）查看表結(jié)構(gòu)

hive (default)> desc formatted stu_buck;
Num Buckets:

（4）導(dǎo)入數(shù)據(jù)到分桶表中，load 的方式

hive (default)> load data inpath '/student.txt' into table stu_buck;

（5）查看創(chuàng)建的分桶表中是否分成 4 個(gè)桶

（6）查詢分桶的數(shù)據(jù)

hive(default)> select * from stu_buck;

（7）分桶規(guī)則：

根據(jù)結(jié)果可知：Hive 的分桶采用對(duì)分桶字段的值進(jìn)行哈希，然后除以桶的個(gè)數(shù)求余的方式?jīng)Q定該條記錄存放在哪個(gè)桶當(dāng)中

2）分桶表操作需要注意的事項(xiàng):

（1）reduce 的個(gè)數(shù)設(shè)置為-1,讓 Job 自行決定需要用多少個(gè) reduce 或者將 reduce 的個(gè)

數(shù)設(shè)置為大于等于分桶表的桶數(shù)

（2）從 hdfs 中 load 數(shù)據(jù)到分桶表中，避免本地文件找不到問(wèn)題

（3）不要使用本地模式

3）insert 方式將數(shù)據(jù)導(dǎo)入分桶表

hive(default)>insert into table stu_buck select * from student_insert;

3 抽樣查詢

對(duì)于非常大的數(shù)據(jù)集，有時(shí)用戶需要使用的是一個(gè)具有代表性的查詢結(jié)果而不是全部結(jié)

果。Hive 可以通過(guò)對(duì)表進(jìn)行抽樣來(lái)滿足這個(gè)需求。

語(yǔ)法: TABLESAMPLE(BUCKET x OUT OF y)

查詢表 stu_buck 中的數(shù)據(jù)。

hive (default)> select * from stu_buck tablesample(bucket 1 out of 4 on id);

注意：x 的值必須小于等于 y 的值，否則

FAILED: SemanticException [Error 10061]: Numerator should not be bigger than denominator in sample clause for table stu_buck

posted @ 2021-08-14 16:15 秋華閱讀(551) 評(píng)論(0) 收藏舉報(bào)

刷新頁(yè)面返回頂部

秋華

Hive基礎(chǔ)（18）：Hive語(yǔ)法(5) DDL(2) 分區(qū)表和分桶表

1 分區(qū)表

2 分桶表

3 抽樣查詢

公告