Hive

基础

概念

基于Hadoop的数据仓库工具，将结构化的数据文件映射为一张表，提供类SQL查询功能
本质：将HQL转化成MapReduce程序
- Hive处理的数据存储在HDFS
- Hive分析数据底层的实现是MapReduce/Sprak…
- 执行程序运行在Yarn上
优点
- 简单、容易上手
- 支持用户自定义函数
缺点
- HQL表达能力有限：迭代式算法无法表达
- 效率比较低
官网地址：http://hive.apache.org/
文档查看地址：https://cwiki.apache.org/confluence/display/Hive/GettingStarted
下载地址：http://archive.apache.org/dist/hive/
github地址：https://github.com/apache/hive

安装

安装MySql

Hive默认使用的元数据库为derby，开启Hive之后就会占用元数据库，且不与其他客户端共享数据，如果想多窗口操作就会报错，操作比较局限

修改mysql库下的user表中的root用户允许任意ip连接

alter user 'root'@'localhost' identified with mysql_native_password by 'root';
update mysql.user set host='%' where user='root';
GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY PASSWORD 'root';
flush privileges;

vim /etc/mysql/mysql.conf.d/mysqld.cnf
bind_address=0.0.0.0

安装Hive

添加环境变量：/etc/profile

1
2
3

#HIVE_HOME
export HIVE_HOME=/hive
export PATH=$PATH:$HIVE_HOME/bin

解决日志Jar包冲突

1	`mv $HIVE_HOME/lib/log4j-slf4j-impl-2.10.0.jar $HIVE_HOME/lib/log4j-slf4j-impl-2.10.0.bak`

拷贝MySQL驱动

1	`cp /opt/software/mysql-connector-java-5.1.48.jar $HIVE_HOME/lib`

配置Metastore到MySql

1 2	`# 在$HIVE_HOME/conf目录下新建hive-site.xml文件 vim $HIVE_HOME/conf/hive-site.xml`

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <!-- jdbc连接的URL -->
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://bigdata1:3306/metastore?useSSL=false</value>
  </property>
  <!-- jdbc连接的Driver-->
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.cj.jdbc.Driver</value>
  </property>
  <!-- jdbc连接的username-->
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>root</value>
  </property>
  <!-- jdbc连接的password -->
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>root</value>
  </property>
  <!-- Hive默认在HDFS的工作目录 -->
  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive/warehouse</value>
  </property>
  <!-- Hive元数据存储的验证 -->
  <property>
    <name>hive.metastore.schema.verification</name>
    <value>false</value>
  </property>
  <!-- 元数据存储授权  -->
  <property>
    <name>hive.metastore.event.db.notification.api.auth</name>
    <value>false</value>
  </property>
  <!-- 打印表头  -->
  <property>
    <name>hive.cli.print.header</name>
    <value>true</value>
  </property>
  <!-- 显示所在数据库  -->
  <property>
    <name>hive.cli.print.current.db</name>
    <value>true</value>
  </property>
</configuration>

配置运行日志：hive-log4j2.properties

1 2	`# Hive的log默认存放在/tmp/atguigu/hive.log目录下（当前用户名下） property.hive.log.dir=/hive/logs`

新建Hive元数据库
1
create database metastore;

初始化Hive元数据库

1	`schematool -initSchema -dbType mysql -verbose`

启动hadoop集群
启动Hive
1
bin/hive

使用Hive

show databases;
show tables;
create table test (id int);
insert into test values(1);
select * from test;

使用元数据服务的方式访问Hive

若配置过以后以后使用Hive的时候一定要启动元数据服务

在hive-site.xml文件中添加如下配置信息

<!-- 指定存储元数据要连接的地址 -->
<property>
  <name>hive.metastore.uris</name>
  <value>thrift://bigdata1:9083</value>
</property>

启动metastore

1 2	`# 启动后窗口不能再操作，需打开一个新的shell窗口做别的操作 hive --service metastore`

启动hive

使用JDBC方式访问Hive

在hive-site.xml文件中添加如下配置信息

<!-- 指定hiveserver2连接的host -->
<property>
  <name>hive.server2.thrift.bind.host</name>
  <value>bigdata1</value>
</property>

<!-- 指定hiveserver2连接的端口号 -->
<property>
  <name>hive.server2.thrift.port</name>
  <value>10000</value>
</property>

启动hiveserver2

1 2	`# 启动后窗口不能再操作，需打开一个新的shell窗口做别的操作 bin/hive --service hiveserver2`

启动beeline客户端

1	`bin/beeline -u jdbc:hive2://bigdata1:10000 -n 用户名`

编写启动脚本

vim $HIVE_HOME/bin/hiveservices.sh

#!/bin/bash
HIVE_LOG_DIR=$HIVE_HOME/logs
if [ ! -d $HIVE_LOG_DIR ]
then
	mkdir -p $HIVE_LOG_DIR
fi
#检查进程是否运行正常，参数1为进程名，参数2为进程端口
function check_process()
{
    pid=$(ps -ef 2>/dev/null | grep -v grep | grep -i $1 | awk '{print $2}')
    ppid=$(netstat -nltp 2>/dev/null | grep $2 | awk '{print $7}' | cut -d '/' -f 1)
    echo $pid
    [[ "$pid" =~ "$ppid" ]] && [ "$ppid" ] && return 0 || return 1
}

function hive_start()
{
    metapid=$(check_process HiveMetastore 9083)
    cmd="nohup hive --service metastore >$HIVE_LOG_DIR/metastore.log 2>&1 &"
    cmd=$cmd" sleep 4; hdfs dfsadmin -safemode wait >/dev/null 2>&1"
    [ -z "$metapid" ] && eval $cmd || echo "Metastroe服务已启动"
    server2pid=$(check_process HiveServer2 10000)
    cmd="nohup hive --service hiveserver2 >$HIVE_LOG_DIR/hiveServer2.log 2>&1 &"
    [ -z "$server2pid" ] && eval $cmd || echo "HiveServer2服务已启动"
}

function hive_stop()
{
    metapid=$(check_process HiveMetastore 9083)
    [ "$metapid" ] && kill $metapid || echo "Metastore服务未启动"
    server2pid=$(check_process HiveServer2 10000)
    [ "$server2pid" ] && kill $server2pid || echo "HiveServer2服务未启动"
}

case $1 in
"start")
    hive_start
    ;;
"stop")
    hive_stop
    ;;
"restart")
    hive_stop
    sleep 2
    hive_start
    ;;
"status")
    check_process HiveMetastore 9083 >/dev/null && echo "Metastore服务运行正常" || echo "Metastore服务运行异常"
    check_process HiveServer2 10000 >/dev/null && echo "HiveServer2服务运行正常" || echo "HiveServer2服务运行异常"
    ;;
*)
    echo Invalid Args!
    echo 'Usage: '$(basename $0)' start|stop|restart|status'
    ;;
esac
# 添加执行权限
chmod u+x $HIVE_HOME/bin/hiveservices.sh

参数配置方式
- 配置文件方式：用户自定义配置会覆盖默认配置
  - Hive也会读入Hadoop的配置，因为Hive是作为Hadoop的客户端启动的
- 命令行参数方式：仅对本次hive启动有效
  1
  hive -hiveconf mapred.reduce.tasks=10;
- 参数声明方式：在HQL中使用SET关键字设定参数，仅对本次hive启动有效
  1
  2
  3
  4
  # 查看 set mapred.reduce.tasks # 设置 set mapred.reduce.tasks=100;

命令

-e：不进入hive的交互窗口执行sql语句
-f：执行脚本中sql语句
在hive cli中查看hdfs文件系统：dfs -ls /;
查看所有历史命令（家目录下）：cat .hivehistory

数据类型

基本数据类型

Hive	Java	长度	例子
TINYINT	byte	1byte有符号整数	20
SMALINT	short	2byte有符号整数	20
INT	int	4byte有符号整数	20
BIGINT	long	8byte有符号整数	20
BOOLEAN	boolean	布尔类型，true或者false	TRUE FALSE
FLOAT	float	单精度浮点数	3.14159
DOUBLE	double	双精度浮点数	3.14159
STRING	string	字符系列。可以指定字符集。可以使用单引号或者双引号。	‘now is the time’ “for all good men”
TIMESTAMP		时间类型
BINARY		字节数组

集合数据类型

数据类型	描述	语法示例
STRUCT	和c语言中的struct类似，都可以通过“点”符号访问元素内容。例如，如果某个列的数据类型是STRUCT{first STRING, last STRING},那么第1个元素可以通过字段.first来引用。	`struct()` 例如`struct<street:string, city:string>`
MAP	MAP是一组键-值对元组集合，使用数组表示法可以访问数据。例如，如果某个列的数据类型是MAP，其中键->值对是’first’->’John’和’last’->’Doe’，那么可以通过字段名[‘last’]获取最后一个元素	`map()` 例如`map<string, int>`
ARRAY	数组是一组具有相同类型和名称的变量的集合。这些变量称为数组的元素，每个数组元素都有一个编号，编号从零开始。例如，数组值为[‘John’, ‘Doe’]，那么第2个元素可以通过数组名[1]进行引用。	`Array()` 例如`array<string>`

案例

{
  "name": "songsong",
  "friends": ["bingbing" , "lili"] ,       //列表Array, 
  "children": {                      //键值Map,
    "xiao song": 19 ,
    "xiaoxiao song": 18
  }
  "address": {                      //结构Struct,
    "street": "hui long guan" ,
    "city": "beijing" 
  }
}
// songsong,bingbing_lili,xiao song:18_xiaoxiao song:19,hui long guan_beijing
// yangyang,caicai_susu,xiao yang:18_xiaoxiao yang:19,chao yang_beijing

-- 行格式字段分隔符为,
-- 集合元素分隔符为_
-- map分隔符:
-- 行分隔符\n
create table test(
  name string,
  friends array<string>,
  children map<string, int>,
  address struct<street:string, city:string>
)
row format delimited fields terminated by ','
collection items terminated by '_'
map keys terminated by ':'
lines terminated by '\n';

-- 导入文本数据到测试表
load data local inpath '/hive/data/persion.txt' into table test;

-- 查询
select friends[1],children['xiao song'],address.city from test

类型转换

隐式类型转换规则

整数类型可转换为范围更广的类型
整数类型、FLOAT和STRING类型可转换成DOUBLE
TINYINT、SMALLINT、INT可转换为FLOAT
BOOLEAN不可以转换为其它的类型

显示类型转换

使用CAST()转换，若转换失败则为NULL
1
select '1'+2, cast('1'as int) + 2;

DDL数据定义

数据库

建库语法

CREATE DATABASE [IF NOT EXISTS] database_name
-- 描述
[COMMENT database_comment]
-- HDFS上的路径
[LOCATION hdfs_path]
-- 其他参数
[WITH DBPROPERTIES (property_name=property_value, ...)];

创建库

create database if not exists mydb   
comment "my first db"
with dbproperties("createtime"="2022-06-08");

create database if not exists mydb1 
location "/mydb1";

查看库

show databases; 
desc database mydb;
-- 查看带参数的数据库
desc database extended mydb;

修改库

1	`alter database mydb set dbproperties("createtime"="2020-04-24","author"="wyh");`

删除库

drop database mydb1;
-- 如果删除的数据库不存在，最好采用if exists判断数据库是否存在
drop database if exists db_hive2;
-- 如果数据库不为空，可以采用cascade命令强制删除
drop database mydb cascade;

数据表

建表语法

-- EXTERANL：外部表
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
-- 列名 列类型 列描述信息
[(col_name data_type [COMMENT col_comment], ...)]
-- 表描述信息
[COMMENT table_comment]
-- 创建分区表指定分区字段 分区列名 列类型
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
-- 创建分桶表指定分桶字段 分桶列名
[CLUSTERED BY (col_name, col_name, ...)
-- 指定分桶数
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
-- 指定一条数据字段与字段的分割符
[ROW FORMAT delimited fields terminated by ... ]
-- 指定集合元素与元素的分割符
[collection items terminated by  ... ]
-- 指定=-=map的kv的分割符
[map keys terminated by ... ]
-- 指定文件存储格式，默认为textfile
[STORED AS file_format]
-- 指定表在hdfs中对应的路径
[LOCATION hdfs_path]
-- 指定表的属性
[TBLPROPERTIES (property_name=property_value, ...)]
-- 基于某个查询建表
[AS select_statement]

内部表

默认，删除表时，Hive会删除表中数据

create database mydb;

-- 1001,zhangsan
-- 1002,lisi
-- 1003,wangwu

create table if not exists test1(
  id int comment "this's id ",
  name string  comment "this 's  name"
)
comment "this's table"
row format delimited fields terminated by ','
STORED as textfile 
TBLPROPERTIES("createtime"="2021-04-24");

create table if not exists test2(
  id int ,
  name string 
)
row format delimited fields terminated by ','
location "/test2.table";

create table if not exists test3(
  id int ,
  name string 
)
row format delimited fields terminated by ','
location "/mydata";

create table if not exists test4(
  id int ,
  name string 
)
row format delimited fields terminated by ',';

外部表

删除表时，Hive不会删除表中数据

create external table if not exists test5(
  id int ,
  name string 
)
row format delimited fields terminated by ',';

-- 内部表外部表相互转换（大小写严格）
alter table test4 set tblproperties('EXTERNAL'='TRUE');
alter table test5 set tblproperties('EXTERNAL'='FALSE');

查看表

show tables
desc test1;
-- 显示详细信息
desc formatted test1;

修改表

-- 1001	zhangsan	10000.1
-- 1002	lisi	10000.2
-- 1003	wangwu	10000.3

create table emp(
  id int , 
  name string, 
  salary double  
) 
row format delimited fields terminated by '\t';  

alter table emp rename to emptest ; 
alter table emptest change column salary sal double ;
alter table emptest add columns (addr string, deptno int );
-- 替换为整体替换字段
alter table emptest replace columns (empid int, empname string);

删除表

1	`drop table dept;`

DML数据操作

导入数据

Load导入

load data [local] inpath '数据的path' [overwrite] into table student [partition (partcol1=val1,…)];
-- load data：表示加载数据
-- local：表示从本地加载数据到hive表；否则从HDFS加载数据到hive表
-- inpath：表示加载数据的路径
-- overwrite：表示覆盖表中已有数据，否则表示追加
-- into table：表示加载到哪张表
-- student：表示具体的表
-- partition：表示上传到指定分区

create table student(id string, name string) row format delimited fields terminated by '\t';

load data local inpath '/opt/module/hive-3.1.2/datas/student.txt' into table student; 	
load data local inpath '/opt/module/hive-3.1.2/datas/student1.txt' into table student; 
load data local inpath '/opt/module/hive-3.1.2/datas/student2.txt' overwrite into table student;
-- load HDFS相当于移动HDFS中的文件
load data inpath '/hivedatas/student.txt' into table student;

Insert导入

insert into：以追加数据的方式插入到表或分区，原有数据不会删除
insert overwrite：会覆盖表中已存在的数据
insert不支持插入部分字段

insert into student values(1017,'ss17'),(1018,'ss18'),(1019,'ss19');  

create table student2(id string, name string) row format delimited fields terminated by '\t';

-- 导入查询的结果
insert into student2 select id, name from student;

As Select导入

1 2	`-- 创建表并加载查询的结果 create table student3 as select id, name from student;`

Location导入

1
2
3

create table student4(id string, name string) 
row format delimited fields terminated by '\t'
location '/student4';

Import导入

先用export导出后，再将数据导入

1	`import table emptest2 from '/emptest';`

导出数据

概述

如果表中的列的值为null，导出到文件中以后通过\N来表示

Insert导出

-- 导出到本地
insert overwrite local directory '/opt/module/hive-3.1.2/datas/insert-result' select * from  emptest;  

-- 格式化出到本地
insert overwrite local directory '/opt/module/hive-3.1.2/datas/insert-result' 
row format delimited fields terminated by ':'
select * from emptest;

-- 导出到HDFS
insert overwrite directory '/insert-result' 
row format delimited fields terminated by ':'
select * from emptest;

Hadoop导出

1	`dfs -get`

Hive Shell导出

1	`hive -e 'select * from default.student;' > /opt/module/hive/datas/export/student4.txt;`

Export导出

export和import主要用于两个Hadoop平台集群之间Hive表迁移（还会导出元数据）

1 2	`-- 只能导出到HDFS上 export table default.student to '/user/hive/warehouse/export/student';`

Sqoop导出

清除表中数据

1 2	`-- 只能删除管理表，不能删除外部表中数据 truncate table student;`

查询

查询语法

-- 查哪些
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
-- 从哪查
FROM table_reference
-- 过滤条件
[WHERE where_condition]
-- 分组
[GROUP BY col_list]
-- 分组后过滤条件
[HAVING having_contiditon]
-- 全局排序
[ORDER BY col_list]
-- 分区排序 | 分区 区内排序
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
-- 限制返回的条数
[LIMIT number]

运算符

算数运算符

运算符	描述
A+B	A和B 相加
A-B	A减去B
A*B	A和B 相乘
A/B	A除以B
A%B	A对B取余
A&B	A和B按位取与
A\|B	A和B按位取或
A^B	A和B按位取异或
~A	A按位取反

比较运算符

操作符	支持的数据类型	描述
A=B	基本数据类型	如果A等于B则返回TRUE，反之返回FALSE
A<=>B	基本数据类型	如果A和B都为NULL，则返回TRUE，如果一边为NULL，返回False
A<>B, A!=B	基本数据类型	A或者B为NULL则返回NULL；如果A不等于B，则返回TRUE，反之返回FALSE
A<B	基本数据类型	A或者B为NULL，则返回NULL；如果A小于B，则返回TRUE，反之返回FALSE
A<=B	基本数据类型	A或者B为NULL，则返回NULL；如果A小于等于B，则返回TRUE，反之返回FALSE
A>B	基本数据类型	A或者B为NULL，则返回NULL；如果A大于B，则返回TRUE，反之返回FALSE
A>=B	基本数据类型	A或者B为NULL，则返回NULL；如果A大于等于B，则返回TRUE，反之返回FALSE
A [NOT] BETWEEN B AND C	基本数据类型	如果A，B或者C任一为NULL，则结果为NULL。如果A的值大于等于B而且小于或等于C，则结果为TRUE，反之为FALSE。如果使用NOT关键字则可达到相反的效果。
A IS NULL	所有数据类型	如果A等于NULL，则返回TRUE，反之返回FALSE
A IS NOT NULL	所有数据类型	如果A不等于NULL，则返回TRUE，反之返回FALSE
IN(数值1, 数值2)	所有数据类型	使用 IN运算显示列表中的值
A [NOT] LIKE B	STRING 类型	B是一个SQL下的简单正则表达式，也叫通配符模式，如果A与其匹配的话，则返回TRUE；反之返回FALSE。B的表达式说明如下：‘x%’表示A必须以字母‘x’开头，‘%x’表示A必须以字母’x’结尾，而‘%x%’表示A包含有字母’x’,可以位于开头，结尾或者字符串中间。如果使用NOT关键字则可达到相反的效果。
A RLIKE B, A REGEXP B	STRING 类型	B是基于java的正则表达式，如果A与其匹配，则返回TRUE；反之返回FALSE。匹配使用的是JDK中的正则表达式接口实现的，因为正则也依据其中的规则。例如，正则表达式必须和整个字符串A相匹配，而不是只需与其字符串匹配。

RLike

是Hive中这个功能的一个扩展，其可以通过Java的正则表达式这个更强大的语言来指定匹配条件

1 2	`-- 查找名字中带有A的员工信息 select * from emp where ename RLIKE '[A]';`

逻辑运算符

操作符	含义
AND	逻辑并
OR	逻辑或
NOT	逻辑否

分组

分组之后，select后面只能跟组标识（分组字段）和聚合函数（分组函数）

计算emp表每个部门的平均工资

1	`select deptno ,avg(sal) avg_sal from emp group by deptno`

计算emp每个部门中每个岗位的最高薪水

1	`select deptno ,job ,max(sal) max_sal from emp group by deptno,job;`

计算emp中每个部门中最高薪水的那个人.

-- 错误
select deptno ,max(sal) max_sal ,ename  from emp group by deptno 

-- 正确
select
  e.deptno , e.sal , e.ename 
from
  emp e 
join
  (select deptno ,max(sal) max_sal from emp group by deptno ) a 
on e.deptno = a.deptno and e.sal = a.max_sal;

计算emp中除了CLERK岗位之外的剩余员工的每个部门的平均工资大于1000的部门和平均工资

select
  deptno , avg(sal) avg_sal 
from
  emp
where
  job != 'CLERK'
group by
  deptno 
having
  avg_sal >2000;

Join

内连接

内连接的结果集取交集

A inner join B on ...

-- emp 和 dept共有的数据（内连接）
select
  e.ename, d.deptno
from
  emp e inner join dept d
on e.deptno = d.deptno;

外连接

主表（驱动表）和从表（匹配表）
外连接的结果集主表的所有数据 + 从表中与主表匹配的数据

左外连接

1
2
3

-- A 主 B 从
A left outer join B  on ....
B right outer Join A  on....

右外连接

1
2
3

-- A 从 B 主
A right outer join B  on ....
B left outer join  A  on ....

-- emp所有的数据 和 dept中与emp匹配的数据
select
  e.ename, d.deptno
from
  emp e left outer join dept d 
on e.deptno = d.deptno; 

select
  e.ename, d.deptno
from
  dept d right outer join emp e 
on e.deptno = d.deptno; 

-- dept中所有的数据 和 emp中与dept匹配的数据
select
  e.ename, d.deptno
from
  dept d left outer join emp e 
on d.deptno = e.deptno;

select
  e.ename, d.deptno
from
  emp e right outer join dept d 
on d.deptno = e.deptno;

-- emp表独有的数据
select
  e.ename, d.deptno
from
  emp e left outer join dept d 
on e.deptno = d.deptno  
where 
  d.deptno is null;

-- dept表独有的数据
select
e.ename, d.deptno
from
dept d left outer join emp e 
on d.deptno = e.deptno 
where
e.deptno is null ;

全外连接

union all：将结果集拼接到一起，不去重
union：将结果集拼接到一起，去重
full outer join

-- emp 和 dept 所有的数据(全外连接，满外连接)
select
  e.ename, d.deptno
from
 emp e left outer join dept d 
on e.deptno = d.deptno  
union all
select
 e.ename, d.deptno
from
dept d left outer join emp e 
on d.deptno = e.deptno;

select
  e.ename, d.deptno
from
 emp e left outer join dept d 
on e.deptno = d.deptno  
union
select
 e.ename, d.deptno
from
dept d left outer join emp e 
on d.deptno = e.deptno;

select
  e.ename, d.deptno
from
 emp e full outer join dept d 
on e.deptno = d.deptno;

--  emp 和 dept 独有的数据
select
  e.ename, d.deptno
from
 emp e full outer join dept d 
on e.deptno = d.deptno 
where 
   e.deptno is null 
or 
   d.deptno is null;

-- 查询 员工名  部门名  位置名 
select
  e.ename, d.dname, l.loc_name
from
  emp e inner join dept d
on e.deptno = d.deptno  
inner join location l 
on d.loc = l.loc;

排序

Order By

全局排序，只有一个Reducer

Sort By

每个Reduce内部排序
大规模数据集order by效率非常低。很多情况不需要全局排序，可使用sort by
Sort by为每个reducer产生一个排序文件，进行内部排序，对全局结果集不是排序
一般与分区一起用，未指定分区则会随机分区

Distribute By

分区
distribute by可以指定运行的Reducer（通常是为了进行后续的聚集操作）
类似MR中partition（自定义分区）进行分区，结合sort by使用
一定要分配多reduce进行处理，否则无法看到distribute by的效果
根据分区字段的hash码与reduce的个数进行模除运算得出
DISTRIBUTE BY语句要写在SORT BY语句之前

1	`select * from emp distribute by deptno sort by empno desc;`

Cluster By

distribute by和sort by字段相同时，可以使用cluster by方式
只能是升序排序，不能指定排序规则

1
2
3

-- 等价
select * from emp cluster by deptno;
select * from emp distribute by deptno sort by deptno;

分区/分桶表

分区表

概念

分区基本操作

创建分区表

create table dept_partition (
  deptno int ,
  dname string,
  loc string 
) 
partitioned by (day string) 
row format delimited fields terminated by '\t';

-- 往分区表加载数据
load data local inpath '/opt/module/hive-3.1.2/datas/dept_20200401.log' into table dept_partition partition(day='20200401'); 
load data local inpath '/opt/module/hive-3.1.2/datas/dept_20200402.log' into table dept_partition partition(day='20200402');
load data local inpath '/opt/module/hive-3.1.2/datas/dept_20200403.log'into table dept_partition partition(day='20200403');

查分区

1	`show partitions dept_partition;`

加分区

1
2
3

alter table dept_partition add partition(day='20200404');
-- 多分区中间为空格
alter table dept_partition add partition(day='20200405') partition(day='20200406');

删分区

1
2
3

alter table dept_partition drop partition(day='20200404');
-- 多分区中间为逗号
alter table dept_partition drop partition(day='20200405'), partition(day='20200406');

二级分区

创建二级分区表

create table dept_partition2(
  deptno int, 
  dname string, 
  loc string 
)
partitioned by (day string ,hour string)
row format delimited fields terminated by '\t';

load data local inpath '/opt/module/hive-3.1.2/datas/dept_20200401.log' into table dept_partition2 partition(day='20200401',hour='12'); 
load data local inpath '/opt/module/hive-3.1.2/datas/dept_20200402.log' into table dept_partition2 partition(day='20200401',hour='13');
load data local inpath '/opt/module/hive-3.1.2/datas/dept_20200403.log' into table dept_partition2 partition(day='20200401',hour='14');
load data local inpath '/opt/module/hive-3.1.2/datas/dept_20200401.log' into table dept_partition2 partition(day='20200402',hour='07'); 
load data local inpath '/opt/module/hive-3.1.2/datas/dept_20200402.log' into table dept_partition2 partition(day='20200402',hour='08');
load data local inpath '/opt/module/hive-3.1.2/datas/dept_20200403.log' into table dept_partition2 partition(day='20200402',hour='09';

分区表与数据产生关联的方式

上传数据后修复

-- 创建分区目录
hadoop fs -mkdir /user/hive/warehouse/mydb.db/dept_partition/day=20200404
-- 上传数据
hadoop fs -put /opt/module/hive-3.1.2/datas/dept_20200403.log  /user/hive/warehouse/mydb.db/dept_partition/day=20200404
-- 分区修复
msck repair table dept_partition;

上传数据后添加分区

-- 创建分区目录
hadoop fs -mkdir /user/hive/warehouse/mydb.db/dept_partition/day=20200405
-- 上传数据
hadoop fs -put /opt/module/hive-3.1.2/datas/dept_20200403.log  /user/hive/warehouse/mydb.db/dept_partition/day=20200405
-- 添加分区
alter table dept_partition add partition(day='20200405');

创建文件夹后load数据到分区

-- 创建分区目录
hadoop fs -mkdir /user/hive/warehouse/mydb.db/dept_partition/day=20200406
-- Load数据到分区
load data local inpath '/opt/module/hive-3.1.2/datas/dept_20200403.log' into table dept_partition partition(day='20200406')

动态分区

参数设置

# 开启动态分区功能（默认true，开启）
hive.exec.dynamic.partition=true
# 设置为非严格模式（动态分区的模式，默认strict，表示必须指定至少一个分区为静态分区，nonstrict模式表示允许所有的分区字段都可以使用动态分区。）
hive.exec.dynamic.partition.mode=nonstrict
# 在所有执行MR的节点上，最大一共可以创建多少个动态分区。默认1000
hive.exec.max.dynamic.partitions=1000
# 在每个执行MR的节点上，最大可以创建多少个动态分区。该参数需要根据实际的数据来设定。比如：源数据中包含了一年的数据，即day字段有365个值，那么该参数就需要设置成大于365，如果使用默认值100，则会报错。
hive.exec.max.dynamic.partitions.pernode=100
# 整个MR Job中，最大可以创建多少个HDFS文件。默认100000
hive.exec.max.created.files=100000
# 当有空分区生成时，是否抛出异常。一般不需要设置。默认false
hive.error.on.empty.partition=false

具体操作

-- 创建分区表
create table dept_partition_dy(
  deptno int, 
  dname string, 
  loc string
)
partitioned by (day string)
row format delimited fields terminated by '\t';

-- 从本地load可能会出错，MapTask节点可能没当前的数据
-- 老版本load操作不跑MR会报错，需要使用到中间表
load data inpath '/dept_partition_dy.txt' into table dept_partition_dy;

create table dept_dy(
   deptno int, 
   dname string, 
   loc string,
   day string 
)
row format delimited fields terminated by '\t';

load data local inpath '/opt/module/hive-3.1.2/datas/dept_partition_dy.txt' into table dept_dy;

分桶表

分区针对的是数据的存储路径，分桶针对的是数据文件
分桶字段与桶数进行取模操作

创建分桶表

create table student_bucket(
  id int, 
  name string 
)
-- 按照id分桶
clustered by (id) into 4 buckets 
row format delimited fields terminated by '\t';   

-- 从本地load可能会出错，MapTask节点可能没当前的数据
-- 老版本load操作不跑MR会报错，需要使用到中间表
load data inpath '/student.txt' into table student_bucket;

抽样查询

对于非常大的数据集，有时用户需要使用的是一个具有代表性的查询结果而不是全部结果。Hive可以通过对表进行抽样来满足这个需求

-- x：第几个桶的数据
-- y：分几个桶
-- x的值必须小于等于y的值，否则报错
TABLESAMPLE(BUCKET x OUT OF y)

select * from stu_buck tablesample(bucket 1 out of 4 on id);

函数

常用函数

系统内置函数

-- 查看系统自带的函数
show functions;
-- 显示自带的函数的用法
desc function upper;
-- 详细显示自带的函数的用法
desc function extended upper;

日期函数

Hive能识别的日期格式yyyy-MM-dd HH:mm:ss

-- unix_timestamp:返回当前或指定时间的时间戳	
select unix_timestamp();
select unix_timestamp("2020-10-28",'yyyy-MM-dd');
-- from_unixtime：将时间戳转为日期格式
select from_unixtime(1603843200);
-- current_date：当前日期
select current_date;
-- current_timestamp：当前的日期加时间
select current_timestamp;
-- to_date：抽取日期部分
select to_date('2020-10-28 12:12:12');
-- year：获取年
select year('2020-10-28 12:12:12');
-- month：获取月
select month('2020-10-28 12:12:12');
-- day：获取日
select day('2020-10-28 12:12:12');
-- hour：获取时
select hour('2020-10-28 12:13:14');
-- minute：获取分
select minute('2020-10-28 12:13:14');
-- second：获取秒
select second('2020-10-28 12:13:14');
-- weekofyear：当前时间是一年中的第几周
select weekofyear('2020-10-28 12:12:12');
-- dayofmonth：当前时间是一个月中的第几天
select dayofmonth('2020-10-28 12:12:12');
-- months_between：两个日期间的月份
select months_between('2020-04-01','2020-10-28');
-- add_months：日期加减月
select add_months('2020-10-28',-3);
-- datediff：两个日期相差的天数
select datediff('2020-11-04','2020-10-28');
-- date_add：日期加天数
select date_add('2020-10-28',4);
-- date_sub：日期减天数
select date_sub('2020-10-28',-4);
-- last_day：日期的当月的最后一天
select last_day('2020-02-30');
-- date_format(): 格式化日期
select date_format('2020-10-28 12:12:12','yyyy/MM/dd HH:mm:ss');

字符串函数

-- upper：转大写
select upper('low');
-- lower：转小写
select lower('low');
-- length：长度
select length("atguigu");
-- trim：前后去空格(不能去中间的空格)
select trim(" atguigu ");
-- lpad：向左补齐，到指定长度
select lpad('atguigu',9,'g');
-- rpad：向右补齐，到指定长度
select rpad('atguigu',9,'g');
-- regexp_replace：使用正则表达式匹配目标字符串，匹配成功后替换！
SELECT regexp_replace('2020/10/25', '/', '-');

取整函数

-- round：四舍五入
select round(3.14);
select round(3.54);
-- ceil：向上取整
select ceil(3.14);
select ceil(3.54);
--floor：向下取整
select floor(3.14);
select floor(3.54);

集合操作

-- size：集合中元素的个数
select size(friends) from test3;
-- map_keys：返回map中的key
select map_keys(children) from test3;
-- map_values：返回map中的value
select map_values(children) from test3;
-- array_contains：判断array中是否包含某个元素
select array_contains(friends,'bingbing') from test3;
-- sort_array：将array中的元素排序
select sort_array(friends) from test3;

多维分析

-- grouping sets

create table testgrouping (
  id int, 
  name string, 
  sex string, 
  deptno int 
)
row format delimited fields terminated by ',';

-- 1001,zhangsan,man,10
-- 1002,xiaohua,female,10
-- 1003,lisi,man,20
-- 1004,xiaohong,female,20
-- 需求：统计每个部门各多少人，男女各多少人，每个部门中男女各多少人 

select deptno, sex ,count(id) from testgrouping group by deptno,sex  grouping sets((deptno,sex), sex, deptno);

空字段赋值

如果value为NULL，则NVL函数返回default_value的值，否则返回value的值，如果两个参数都为NULL ，则返回NULL

NVL( value，default_value)

select comm, nvl(comm, -1) from emp;
select comm, nvl(comm,mgr) from emp;

CASE函数

-- 数据 
-- +---------------+------------------+--------------+
-- | emp_sex.name  | emp_sex.dept_id  | emp_sex.sex  |
-- +---------------+------------------+--------------+
-- | 悟空           | A                | 男           |
-- | 大海           | A                | 男           |
-- | 宋宋           | B                | 男           |
-- | 凤姐           | A                | 女           |
-- | 婷姐           | B                | 女           |
-- | 婷婷           | B                | 女           |
-- +---------------+------------------+--------------+

-- 需求
-- dept_Id   男      女
-- A         2       1
-- B         1       2

-- SQL
select 
  dept_Id, 
  sum(case sex when '男' then 1 else 0 end ) man, 
  sum(case sex when '女' then 1 else 0 end ) female
from
  emp_sex 
group by dept_Id;

select 
  dept_Id, 
  sum(if(sex='男',1,0)) man, 
  sum(if(sex='女',1,0)) female
from
  emp_sex 
group by dept_Id;

行转列

-- 相关函数
-- 字符串拼接
concat()
-- 字符串拼接，只支持sting或array<string>
concat_ws()
-- 去重汇总
collect_set()
-- 汇总
collect_list()

-- 数据
-- +----------+----------------+-------------+
-- | name     | constellation  | blood_type  |
-- +----------+----------------+-------------+
-- | 孙悟空		| 白羊座          | A           |
-- | 大海			 | 射手座          | A           |
-- | 宋宋      | 白羊座        	| B           |
-- | 猪八戒    | 白羊座          | A           |
-- | 凤姐      | 射手座          | A           |
-- | 苍老师    | 白羊座          | B           |
-- +----------+----------------+-------------+

-- 需求
-- 射手座,A            大海|凤姐
-- 白羊座,A            孙悟空|猪八戒
-- 白羊座,B            宋宋|苍老师

-- SQL
-- 将constellation和blood_type拼接到一起
select
  name, 
  concat_ws(',',constellation,blood_type) c_b
from 
  person_info => t1;

-- 按照 c_b分组，在组内将name进行汇总
select
  t1.c_b, concat_ws('|',collect_set(t1.name)) names
from 
 (select
  name, 
  concat_ws(',',constellation,blood_type) c_b
from 
  person_info)t1
group by t1.c_b

列转行

-- 相关函数
-- 将数组或者map拆分成多行
explode()
-- 侧写表(虚拟表)
LATERAL VIEW

-- 数据
-- +-------------------+----------------------+
-- | movie_info.movie  | movie_info.category  |
-- +-------------------+----------------------+
-- | 《疑犯追踪》        | 悬疑,动作,科幻,剧情     |
-- | 《Lie to me》      | 悬疑,警匪,动作,心理,剧情|
-- | 《战狼2》          | 战争,动作,灾难         |
-- +-------------------+----------------------+

-- 结果
--《疑犯追踪》      悬疑
--《疑犯追踪》      动作
--《疑犯追踪》      科幻
--《疑犯追踪》      剧情
--《Lie to me》   悬疑
--《Lie to me》   警匪
--《Lie to me》   动作
--《Lie to me》   心理
--《Lie to me》   剧情
--《战狼2》        战争
--《战狼2》        动作
--《战狼2》        灾难

-- SQL
select
   movie,
   category_name
from 
   movie_info
-- movie_info_tmp：侧写表的名字，一般用不上
-- category_name：拆分的列在侧写表中的名字
LATERAL view explode(split(category,',')) movie_info_tmp as category_name;

窗口函数

-- 会为每条数据都开启一个窗口. 默认的窗口大小就是当前数据集的大小
over()
-- 会按照指定的字段进行分区， 将分区字段的值相同的数据划分到相同的区
-- 每个区中的每条数据都会开启一个窗口.每条数据的窗口大小默认为当前分区数据集的大小
over(partition by ....)
-- 会在窗口中按照指定的字段对数据进行排序
-- 会为每条数据都开启一个窗口,默认的窗口大小为从数据集开始到当前行
over(order by ....)
-- 会按照指定的字段进行分区， 将分区字段的值相同的数据划分到相同的区
-- 在每个区中会按照指定的字段进行排序
-- 会为每条数据都开启一个窗口,默认的窗口大小为当前分区中从数据集开始到当前行
over(partition by .... order by ....)
-- 指定每条数据的窗口大小
over(partition by ... order by ... rows between ... and ....)

-- 当前行
CURRENT ROW
-- 往前n行数据（范围）
n PRECEDING
-- 往后n行数据
n FOLLOWING
-- 起点
UNBOUNDED
-- 表示从前面的起点
	UNBOUNDED PRECEDING
-- 表示到后面的终点
    UNBOUNDED FOLLOWING
-- 往前第n行数据（一行数据）
LAG(col,n,default_val)
-- 往后第n行数据
LEAD(col,n, default_val)
-- 把有序窗口的行分发到指定数据的组中，各个组有编号，编号从1开始，对于每一行，NTILE返回此行所属的组的编号
-- n必须为int类型
NTILE(n)

-- 关键字
-- 全局排序 或者 窗口函数中排序
order by
-- 分区
distribute by
-- 区内排序
sort by
-- 分区排序
cluster by
-- 窗口函数中分区
partition by
-- 建表指定分区字段
partitioned by
-- 建表指定分桶字段
clustered by
  
-- 数据
-- +----------------+---------------------+----------------+
-- | business.name  | business.orderdate  | business.cost  |
-- +----------------+---------------------+----------------+
-- | jack           | 2017-01-01          | 10             |
-- | tony           | 2017-01-02          | 15             |
-- | jack           | 2017-02-03          | 23             |
-- | tony           | 2017-01-04          | 29             |
-- | jack           | 2017-01-05          | 46             |
-- | jack           | 2017-04-06          | 42             |
-- | tony           | 2017-01-07          | 50             |
-- | jack           | 2017-01-08          | 55             |
-- | mart           | 2017-04-08          | 62             |
-- | mart           | 2017-04-09          | 68             |
-- | neil           | 2017-05-10          | 12             |
-- | mart           | 2017-04-11          | 75             |
-- | neil           | 2017-06-12          | 80             |
-- | mart           | 2017-04-13          | 94             |
-- +----------------+---------------------+----------------+  

-- 需求
==========
==========
-- 需求一：查询在2017年4月份购买过的顾客及总人数
select
  name, orderdate, cost 
from 
  business 
where month(orderdate) = '4';
==========
-- 方式1
select
  name, orderdate, cost 
from 
  business 
where substring(orderdate,0,7) = '2017-04' => t1;
-- +-------+-------------+-------+
-- | name  |  orderdate  | cost  |
-- +-------+-------------+-------+
-- | jack  | 2017-04-06  | 42    |
-- | mart  | 2017-04-08  | 62    |
-- | mart  | 2017-04-09  | 68    |
-- | mart  | 2017-04-11  | 75    |
-- | mart  | 2017-04-13  | 94    |
-- +-------+-------------+-------+
-- 按照name分组，求count
select
  t1.name, count(t1.name) over()
from 
  (select
  		name, orderdate, cost 
		from 
  		business 
		where substring(orderdate,0,7) = '2017-04'
  ) t1
group by t1.name;
==========
-- 方式2
select
  distinct(t1.name)
from 
  (select
  		name, orderdate, cost 
		from 
  		business 
		where substring(orderdate,0,7) = '2017-04'
  ) t1;
-- +----------+
-- | t1.name  |
-- +----------+
-- | jack     |
-- | mart     |
-- +----------+
select 
   t2.name, count(t2.name) over()
from 
(select
  distinct(t1.name)
from 
  (select
  name, orderdate, cost 
from 
  business 
where substring(orderdate,0,7) = '2017-04')t1) t2
-- +----------+-----------------+
-- | t2.name  | count_window_0  |
-- +----------+-----------------+
-- | mart     | 2               |
-- | jack     | 2               |
-- +----------+-----------------+
==========
==========
-- 需求二：查询顾客的购买明细及所有顾客的月购买总额
select
  name, 
  orderdate, 
  cost , 
  sum(cost) over(partition by substring(orderdate,0,7))  month_cost
from
  business;
-- +-------+-------------+-------+-------------+
-- | name  |  orderdate  | cost  | month_cost  |
-- +-------+-------------+-------+-------------+
-- | jack  | 2017-01-01  | 10    | 205         |
-- | jack  | 2017-01-08  | 55    | 205         |
-- | tony  | 2017-01-07  | 50    | 205         |
-- | jack  | 2017-01-05  | 46    | 205         |
-- | tony  | 2017-01-04  | 29    | 205         |
-- | tony  | 2017-01-02  | 15    | 205         |
-- | jack  | 2017-02-03  | 23    | 23          |
-- | mart  | 2017-04-13  | 94    | 341         |
-- | jack  | 2017-04-06  | 42    | 341         |
-- | mart  | 2017-04-11  | 75    | 341         |
-- | mart  | 2017-04-09  | 68    | 341         |
-- | mart  | 2017-04-08  | 62    | 341         |
-- | neil  | 2017-05-10  | 12    | 12          |
-- | neil  | 2017-06-12  | 80    | 80          |
-- +-------+-------------+-------+-------------+
select
  name, 
  orderdate,
  cost, 
  sum(cost) over(partition by name, substring(orderdate,0,7)) name_month_cost
from 
  business;
-- +-------+-------------+-------+------------------+
-- | name  |  orderdate  | cost  | name_month_cost  |
-- +-------+-------------+-------+------------------+
-- | jack  | 2017-01-05  | 46    | 111              |
-- | jack  | 2017-01-08  | 55    | 111              |
-- | jack  | 2017-01-01  | 10    | 111              |
-- | jack  | 2017-02-03  | 23    | 23               |
-- | jack  | 2017-04-06  | 42    | 42               |
-- | mart  | 2017-04-13  | 94    | 299              |
-- | mart  | 2017-04-11  | 75    | 299              |
-- | mart  | 2017-04-09  | 68    | 299              |
-- | mart  | 2017-04-08  | 62    | 299              |
-- | neil  | 2017-05-10  | 12    | 12               |
-- | neil  | 2017-06-12  | 80    | 80               |
-- | tony  | 2017-01-04  | 29    | 94               |
-- | tony  | 2017-01-02  | 15    | 94               |
-- | tony  | 2017-01-07  | 50    | 94               |
-- +-------+-------------+-------+------------------+
==========
==========
-- 需求三：将每个顾客的cost按照日期进行累加
select 
  name, 
  orderdate,
  cost,
  sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and CURRENT ROW) lj
from 
  business;

select 
  name, 
  orderdate,
  cost,
  sum(cost) over(partition by name order by orderdate ) lj
from 
  business; 
-- +-------+-------------+-------+------+
-- | name  |  orderdate  | cost  |  lj  |
-- +-------+-------------+-------+------+
-- | jack  | 2017-01-01  | 10    | 10   |
-- | jack  | 2017-01-05  | 46    | 56   |
-- | jack  | 2017-01-08  | 55    | 111  |
-- | jack  | 2017-02-03  | 23    | 134  |
-- | jack  | 2017-04-06  | 42    | 176  |
-- | mart  | 2017-04-08  | 62    | 62   |
-- | mart  | 2017-04-09  | 68    | 130  |
-- | mart  | 2017-04-11  | 75    | 205  |
-- | mart  | 2017-04-13  | 94    | 299  |
-- | neil  | 2017-05-10  | 12    | 12   |
-- | neil  | 2017-06-12  | 80    | 92   |
-- | tony  | 2017-01-02  | 15    | 15   |
-- | tony  | 2017-01-04  | 29    | 44   |
-- | tony  | 2017-01-07  | 50    | 94   |
-- +-------+-------------+-------+------+


-- 需求三：将所有顾客的cost按照日期进行累加
select 
  name, 
  orderdate,
  cost,
  sum(cost) over(order by orderdate ) lj
from 
  business 
-- +-------+-------------+-------+------+
-- | name  |  orderdate  | cost  |  lj  |
-- +-------+-------------+-------+------+
-- | jack  | 2017-01-01  | 10    | 10   |
-- | tony  | 2017-01-02  | 15    | 25   |
-- | tony  | 2017-01-04  | 29    | 54   |
-- | jack  | 2017-01-05  | 46    | 100  |
-- | tony  | 2017-01-07  | 50    | 150  |
-- | jack  | 2017-01-08  | 55    | 205  |
-- | jack  | 2017-02-03  | 23    | 228  |
-- | jack  | 2017-04-06  | 42    | 270  |
-- | mart  | 2017-04-08  | 62    | 332  |
-- | mart  | 2017-04-09  | 68    | 400  |
-- | mart  | 2017-04-11  | 75    | 475  |
-- | mart  | 2017-04-13  | 94    | 569  |
-- | neil  | 2017-05-10  | 12    | 581  |
-- | neil  | 2017-06-12  | 80    | 661  |
-- +-------+-------------+-------+------+

--  求所有顾客的购买明细及按照日期进行排序后
--  求所有顾客的cost  第一行 到 当前行 累加
--    所有顾客的cost 上一行 到 当前行 的累加和 
--    所有顾客的cost 上一行 到 下一行 的累加和
--    所有顾客的cost 当前行 到 下一行 的累加和
--    所有顾客的cost 当前行 到 最后一行的累加和

select
  name,
  orderdate,
  cost,
  sum(cost) over(order by orderdate rows between UNBOUNDED PRECEDING and CURRENT ROW) f_c,
  sum(cost) over(order by orderdate rows between 1 PRECEDING and CURRENT ROW ) p_c,
  sum(cost) over(order by orderdate rows between 1 PRECEDING and 1 FOLLOWING ) p_n,
  sum(cost) over(order by orderdate rows between CURRENT ROW and 1 FOLLOWING ) c_n,
  sum(cost) over(order by orderdate rows between CURRENT ROW and UNBOUNDED FOLLOWING ) c_l
from
  business;
-- +-------+-------------+-----+------+------+------+------+------+
-- | name  |  orderdate  | cost| f_c  | p_c  | p_n  | c_n  | c_l  |
-- +-------+-------------+-----+------+------+------+------+------+
-- | jack  | 2017-01-01  | 10  | 10   | 10   | 25   | 25   | 661  |
-- | tony  | 2017-01-02  | 15  | 25   | 25   | 54   | 44   | 651  |
-- | tony  | 2017-01-04  | 29  | 54   | 44   | 90   | 75   | 636  |
-- | jack  | 2017-01-05  | 46  | 100  | 75   | 125  | 96   | 607  |
-- | tony  | 2017-01-07  | 50  | 150  | 96   | 151  | 105  | 561  |
-- | jack  | 2017-01-08  | 55  | 205  | 105  | 128  | 78   | 511  |
-- | jack  | 2017-02-03  | 23  | 228  | 78   | 120  | 65   | 456  |
-- | jack  | 2017-04-06  | 42  | 270  | 65   | 127  | 104  | 433  |
-- | mart  | 2017-04-08  | 62  | 332  | 104  | 172  | 130  | 391  |
-- | mart  | 2017-04-09  | 68  | 400  | 130  | 205  | 143  | 329  |
-- | mart  | 2017-04-11  | 75  | 475  | 143  | 237  | 169  | 261  |
-- | mart  | 2017-04-13  | 94  | 569  | 169  | 181  | 106  | 186  |
-- | neil  | 2017-05-10  | 12  | 581  | 106  | 186  | 92   | 92   |
-- | neil  | 2017-06-12  | 80  | 661  | 92   | 92   | 80   | 80   |
-- +-------+-------------+-----+------+------+------+------+------+

-- 需求四：查询每个顾客上次的购买时间 及 下一次的购买时间 
select
   name,
   cost, 
   orderdate c_orderdate,
   lag(orderdate ,1 ,'1970-01-01') over(partition by name  order by orderdate) p_orderdate,
   lead(orderdate ,1 ,'9999-01-01') over(partition by name  order by orderdate) p_orderdate
from 
  business

-- +-------+-------+--------------+--------------+--------------+
-- | name  | cost  | c_orderdate  | p_orderdate  | p_orderdate  |
-- +-------+-------+--------------+--------------+--------------+
-- | jack  | 10    | 2017-01-01   | 1970-01-01   | 2017-01-05   |
-- | jack  | 46    | 2017-01-05   | 2017-01-01   | 2017-01-08   |
-- | jack  | 55    | 2017-01-08   | 2017-01-05   | 2017-02-03   |
-- | jack  | 23    | 2017-02-03   | 2017-01-08   | 2017-04-06   |
-- | jack  | 42    | 2017-04-06   | 2017-02-03   | 9999-01-01   |
-- | mart  | 62    | 2017-04-08   | 1970-01-01   | 2017-04-09   |
-- | mart  | 68    | 2017-04-09   | 2017-04-08   | 2017-04-11   |
-- | mart  | 75    | 2017-04-11   | 2017-04-09   | 2017-04-13   |
-- | mart  | 94    | 2017-04-13   | 2017-04-11   | 9999-01-01   |
-- | neil  | 12    | 2017-05-10   | 1970-01-01   | 2017-06-12   |
-- | neil  | 80    | 2017-06-12   | 2017-05-10   | 9999-01-01   |
-- | tony  | 15    | 2017-01-02   | 1970-01-01   | 2017-01-04   |
-- | tony  | 29    | 2017-01-04   | 2017-01-02   | 2017-01-07   |
-- | tony  | 50    | 2017-01-07   | 2017-01-04   | 9999-01-01   |
-- +-------+-------+--------------+--------------+--------------+

-- 需求五：查询前20%时间的订单信息
select 
  t1.name, 
  t1.orderdate,
  t1.cost ,
  t1.gid
from 
(select
  name, 
  orderdate,
  cost, 
  ntile(5) over(order by orderdate ) gid
from 
  business) t1
where t1.gid = 1 ; 

-- +----------+---------------+----------+---------+
-- | t1.name  | t1.orderdate  | t1.cost  | t1.gid  |
-- +----------+---------------+----------+---------+
-- | jack     | 2017-01-01    | 10       | 1       |
-- | tony     | 2017-01-02    | 15       | 1       |
-- | tony     | 2017-01-04    | 29       | 1       |
-- +----------+---------------+----------+---------+

Rank

-- 相关函数（100，100，99，99，98）
-- 排序相同时会重复，总数不会变（1，1，3，3，5）
rank()
-- 排序相同时会重复，总数会减少（1，1，2，2，3）
dense_rank()
-- 会根据顺序计算（1，2，3，4，5）
row_number()

-- 数据
-- +-------------+----------------+--------------+
-- | score.name  | score.subject  | score.score  |
-- +-------------+----------------+--------------+
-- | 孙悟空       | 语文            | 87           |
-- | 孙悟空       | 数学            | 95           |
-- | 孙悟空       | 英语            | 68           |
-- | 大海         | 语文           | 94           |
-- | 大海         | 数学           | 56           |
-- | 大海         | 英语           | 84           |
-- | 宋宋         | 语文           | 64           |
-- | 宋宋         | 数学           | 86           |
-- | 宋宋         | 英语           | 84           |
-- | 婷婷         | 语文           | 65           |
-- | 婷婷         | 数学           | 85           |
-- | 婷婷         | 英语           | 78           |
-- +-------------+----------------+--------------+

-- 需求: 按照学科进行排名
select
  name, 
  subject,
  score,
  rank() over(partition by subject order by score desc ) rk,
  dense_rank() over(partition by subject order by score desc ) drk ,
  row_number() over(partition by subject order by score desc ) rn
from
  score;
-- +-------+----------+--------+-----+------+-----+
-- | name  | subject  | score  | rk  | drk  | rn  |
-- +-------+----------+--------+-----+------+-----+
-- | 孙悟空 | 数学      | 95     | 1   | 1    | 1   |
-- | 宋宋   | 数学      | 86     | 2   | 2    | 2   |
-- | 婷婷   | 数学      | 85     | 3   | 3    | 3   |
-- | 大海   | 数学      | 56     | 4   | 4    | 4   |
-- | 宋宋   | 英语      | 84     | 1   | 1    | 1   |
-- | 大海   | 英语      | 84     | 1   | 1    | 2   |
-- | 婷婷   | 英语      | 78     | 3   | 2    | 3   |
-- | 孙悟空 | 英语      | 68     | 4   | 3    | 4   |
-- | 大海   | 语文      | 94     | 1   | 1    | 1   |
-- | 孙悟空 | 语文      | 87     | 2   | 2    | 2   |
-- | 婷婷   | 语文      | 65     | 3   | 3    | 3   |
-- | 宋宋   | 语文      | 64     | 4   | 4    | 4   |
-- +-------+----------+--------+-----+------+-----+

自定义函数

步骤

官方地址：https://cwiki.apache.org/confluence/display/Hive/HivePlugins

导入依赖

<dependency>
  <groupId>org.apache.hive</groupId>
  <artifactId>hive-exec</artifactId>
  <version>3.1.2</version>
</dependency>

继承Hive提供的类

1 2	`org.apache.hadoop.hive.ql.udf.generic.GenericUDF org.apache.hadoop.hive.ql.udf.generic.GenericUDTF`

实现类中的抽象方法
运行函数
- 方式1：永久使用
  - 放入lib目录
  - 重启Hive
- 方式2：临时使用，hive窗口关闭后会失效
  1. 在hive的命令行窗口创建函数
  2. 添加jar
    1
    add jar linux_jar_path

创建function

1
2
3

create [temporary] function [dbname.]function_name AS class_name;

create temporary function mylen as 'cc.mousse.CalStrLenUDF';

在hive的命令行窗口删除函数

1	`drop [temporary] function [if exists] [dbname.]function_name;`

UDF函数

user-defined function，一进一出

/**
 * 计算字符串长度
 */
public class CalStrLenUDF extends GenericUDF {

  /**
     * 初始化方法
     * @param objectInspectors 传入到函数中参数对应类型的鉴别器对象（Hive自己封装的类型）
     * @return 指定函数返回值类型对应的鉴别器对象
     * @throws UDFArgumentException 参数异常
     */
  @Override
  public ObjectInspector initialize(ObjectInspector[] objectInspectors) throws UDFArgumentException {
    // 校验函数参数个数
    if (objectInspectors == null || objectInspectors.length != 1) {
      throw new UDFArgumentLengthException("参数个数错误");
    }
    // 校验函数参数类型是否为基本参数类型
    if (objectInspectors[0].getCategory() != ObjectInspector.Category.PRIMITIVE) {
      throw new UDFArgumentTypeException(0, "参数类型错误");
    }
    // 返回函数返回值类型对应的鉴别器类型
    return PrimitiveObjectInspectorFactory.javaIntObjectInspector;
  }

  /**
     * 函数核型处理方法
     * @param deferredObjects 传入到函数的参数
     * @return 函数返回值
     */
  @Override
  public Object evaluate(DeferredObject[] deferredObjects) throws HiveException {
    // 获取参数
    Object o = deferredObjects[0].get();
    // 返回字符串长度
    return o == null ? 0 : o.toString().length();
  }

  @Override
  public String getDisplayString(String[] strings) {
    return "";
  }

}

UDTF函数

User-Defined Table-Generating Functions，一进多出

/**
 * 分割字符串为独立的单词
 * input：'hello-1,world-2,hadoop-3,hive,4' ',' '-'
 * output：
 *      hello   1
 *      world   2
 *      hadoop  3
 *      hive    4
 */
public class SplitStrToColsUDTF extends GenericUDTF {

  // 返回用
  private List<String> outVal = new ArrayList<>();

  /**
     * 初始化方法
     * @param argOIs 结构体鉴别器对象
     */
  @Override
  public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException {
    List<? extends StructField> structFieldRefs = argOIs.getAllStructFieldRefs();
    // 校验函数参数个数
    if (structFieldRefs.size() != 3) {
      throw new UDFArgumentLengthException("参数个数错误");
    }
    // 校验函数参数类型是否为基本参数类型
    for (int i = 0; i < structFieldRefs.size(); i++) {
      if (structFieldRefs.get(i).getFieldObjectInspector().getCategory() != ObjectInspector.Category.PRIMITIVE) {
        throw new UDFArgumentTypeException(i, "参数类型错误");
      }
    }
    // 返回函数返回值类型对应的鉴别器类型
    // 返回函数值的列名
    List<String> structFieldNames = new ArrayList<>();
    structFieldNames.add("word");
    structFieldNames.add("num");
    // 返回函数值的类型
    List<ObjectInspector> structFieldObjectInspectors = new ArrayList<>();
    structFieldObjectInspectors.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
    structFieldObjectInspectors.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
    return ObjectInspectorFactory.getStandardStructObjectInspector(structFieldNames, structFieldObjectInspectors);
  }

  /**
     * 函数核心处理方法
     * @param objects 函数参数
     */
  @Override
  public void process(Object[] objects) throws HiveException {
    // 切割
    String[] words = objects[0].toString().split(objects[1].toString());
    // 将每个单词作为一行写出去
    for (String word : words) {
      outVal.clear();
      outVal.addAll(Arrays.asList(word.split(objects[2].toString())));
      forward(outVal);
    }
  }

}

UDAF函数

User-Defined Aggregation Function，聚集函数，多进一出

压缩存储

压缩

参数配置

mapred-site.xml

参数	默认值	阶段	建议
io.compression.codecs （在core-site.xml中配置）	org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.BZip2Codec, org.apache.hadoop.io.compress.Lz4Codec	输入压缩	Hadoop使用文件扩展名判断是否支持某种编解码器
mapreduce.map.output.compress	false	mapper输出	这个参数设为true启用压缩
mapreduce.map.output.compress.codec	org.apache.hadoop.io.compress.DefaultCodec	mapper输出	使用LZO、LZ4或snappy编解码器在此阶段压缩数据
mapreduce.output.fileoutputformat.compress	false	reducer输出	这个参数设为true启用压缩
mapreduce.output.fileoutputformat.compress.codec	org.apache.hadoop.io.compress. DefaultCodec	reducer输出	使用标准工具或者编解码器，如gzip和bzip2
mapreduce.output.fileoutputformat.compress.type	RECORD	reducer输出	SequenceFile输出使用的压缩类型：NONE和BLOCK

开启Map输出阶段压缩

开启hive中间传输数据压缩功能

1	`set hive.exec.compress.intermediate=true;`

开启mapreduce中map输出压缩功能

1	`set mapreduce.map.output.compress=true;`

设置mapreduce中map输出数据的压缩方式

1 2	`set mapreduce.map.output.compress.codec= org.apache.hadoop.io.compress.SnappyCodec;`

执行查询语句
1
select count(ename) name from emp;

开启Reduce输出阶段压缩

开启hive最终输出数据压缩功能
1
set hive.exec.compress.output=true;

开启mapreduce最终输出数据压缩

1	`set mapreduce.output.fileoutputformat.compress=true;`

设置mapreduce最终数据输出压缩为块压缩

1	`set mapreduce.output.fileoutputformat.compress.type=BLOCK;`

测试一下输出结果是否是压缩文件

1 2	`insert overwrite local directory '/opt/module/hive/datas/distribute-result' select * from emp distribute by deptno sort by empno desc;`

存储

创建

1
2
3

create table xxx (
)
stored as 存储格式;

列式存储和行式存储

+---+---+---+
| a | b | c |		行存储：a1 b1 c1 a2 b2 c2 a3 b3 c3
+---+---+---+		列存储：a1 a2 a3 b1 b2 b3 c1 c2 c3
|a1 |b1 |c1 |
|a2 |b2 |c2 |
|a3 |b3 |c3 |
+---+---+---+

行存储
- 查询满足条件的一整行数据的时候，列存储则需要去每个聚集的字段找到对应的每个列的值，行存储只需要找到其中一个值，其余的值都在相邻地方，所以此时行存储查询的速度更快
- TEXTFILE，SEQUENCEFILE
列存储
- 因为每个字段的数据聚集存储，在查询只需要少数几个字段的时候，能大大减少读取的数据量；每个字段的数据类型一定是相同的，列式存储可以针对性的设计更好的设计压缩算法
- ORC，PARQUET

TextFile格式

默认，数据不压缩，磁盘开销大，数据解析开销大
可结合Gzip、Bzip2使用，但使用Gzip这种方式，hive不会对数据进行切分，从而无法对数据进行并行操作

Orc格式

-------------------
		Index Data
		---------------

		Row Data

		---------------
		Stripe Footer
-------------------
		Index Data
		---------------

		Row Data

		---------------
		Stripe Footer
-------------------
		File Footer
		---------------
		Postsript
		---------------

由1个或多个stripe组成，每个stripe一般为HDFS的块大小，每一个stripe包含多条记录，这些记录按照列进行独立存储
不能直接load

默认使用ZLIB压缩

create table xxx(
)
stored as orc
tblproperties("orc.compress"="ZLIB");
-- 设置orc存储不使用压缩
tblproperties("orc.compress"="NONE");

Parquet格式

以二进制方式存，不可直接读取，文件中包括该文件的数据和元数据，自解析
1
2
stored as parquet tblproperties("parquet.compression"="SNAPPY");

调优

执行计划

不会真正执行SQL

基本语法

1	`EXPLAIN [EXTENDED \| DEPENDENCY \| AUTHORIZATION] query`

查看详细执行计划

1 2	`explain extended select * from emp; explain extended select deptno, avg(sal) avg_sal from emp group by deptno;`

Fetch抓取

Hive中对某些情况的查询可以不必使用MapReduce计算

该属性修改为more以后，在全局查找、字段查找、limit查找等都不走mapreduce

<property>
  <name>hive.fetch.task.conversion</name>
  <value>more</value>
  <description>
    Expects one of [none, minimal, more].
    Some select queries can be converted to single FETCH task minimizing latency.
    Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incurs RS), lateral views and joins.
    0. none : disable hive.fetch.task.conversion
    1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
    2. more  : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns)
  </description>
</property>

本地模式

可以通过本地模式在单台机器上处理所有的任务。对于小数据集，执行时间可以明显被缩短

可以通过设置hive.exec.mode.local.auto的值为true，来让Hive在适当的时候自动启动这个优化

-- 开启本地mr
set hive.exec.mode.local.auto=true;
-- 设置local mr的最大输入数据量，当输入数据量小于这个值时采用local  mr的方式，默认为134217728，即128M
set hive.exec.mode.local.auto.inputbytes.max=50000000;
-- 设置local mr的最大输入文件个数，当输入文件个数小于这个值时采用local mr的方式，默认为4
set hive.exec.mode.local.auto.input.files.max=10;

表的优化

小表Join大表

新版的hive已经对小表JOIN大表和大表JOIN小表进行了优化。小表放在左边和右边已经没有明显区别

参数设置

-- 设置自动选择Mapjoin，默认为true
set hive.auto.convert.join = true; 
-- 大表小表的阈值设置（默认25M以下认为是小表）
set hive.mapjoin.smalltable.filesize = 25000000;

大表Join大表

空Key过滤

有时join超时是因为某些key对应的数据太多，而相同key对应的数据都会发送到相同的reducer上，从而导致内存不够。此时我们应该仔细分析这些异常的key，很多情况下，这些key对应的数据是异常数据，我们需要在SQL语句中进行过滤。例如key对应的字段为空

1	`insert overwrite table jointable select n.* from (select * from nullidtable where id is not null ) n left join bigtable o on n.id = o.id;`

空Key转换

有时虽然某个key为空对应的数据很多，但是相应的数据不是异常数据，必须要包含在join的结果中，此时我们可以表a中key为空的字段赋一个随机的值，使得数据随机均匀地分不到不同的reducer上

-- 设置5个reduce个数
set mapreduce.job.reduces = 5;
-- JOIN两张表
insert overwrite table jointable
select n.* from nullidtable n full join bigtable o on
-- 如果为空随机生成一个值
nvl(n.id,rand()) = o.id;

Sort Merge Bucket join

-- 创建分桶表1,桶的个数不要超过可用CPU的核数
create table bigtable_buck1(
    id bigint,
    t bigint,
    uid string,
    keyword string,
    url_rank int,
    click_num int,
    click_url string)
clustered by(id) 
sorted by(id)
into 6 buckets
row format delimited fields terminated by '\t';

insert into bigtable_buck1 select * from bigtable; 
-- 创建分通表2,桶的个数不要超过可用CPU的核数
create table bigtable_buck2(
    id bigint,
    t bigint,
    uid string,
    keyword string,
    url_rank int,
    click_num int,
    click_url string)
clustered by(id)
sorted by(id) 
into 6 buckets
row format delimited fields terminated by '\t';

insert into bigtable_buck2 select * from bigtable; 
-- 设置参数
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

Group By

默认Map阶段同一Key数据分发给一个reduce，当一个key数据过大时就倾斜
不是所有的聚合操作都需要在Reduce端完成，很多聚合操作都可以先在Map端进行部分聚合，最后在Reduce端得出最终结果

开启Map端聚合参数设置

-- 是否在Map端进行聚合，默认为True
set hive.map.aggr = true
-- 在Map端进行聚合操作的条目数目
set hive.groupby.mapaggr.checkinterval = 100000
-- 有数据倾斜的时候进行负载均衡（默认是false）
set hive.groupby.skewindata = true
-- 当选项设定为 true，生成的查询计划会有两个MR Job。第一个MR Job中，Map的输出结果会随机分布到Reduce中，每个Reduce做部分聚合操作，并输出结果，这样处理的结果是相同的Group By Key有可能被分发到不同的Reduce中，从而达到负载均衡的目的；第二个MR Job再根据预处理的数据结果按照Group By Key分布到Reduce中（这个过程可以保证相同的Group By Key被分布到同一个Reduce中），最后完成最终的聚合操作

去重统计

数据量小的时候无所谓，数据量大的情况下，由于COUNT DISTINCT操作需要用一个Reduce Task来完成，这一个Reduce需要处理的数据量太大，就会导致整个Job很难完成，一般COUNT DISTINCT使用先GROUP BY再COUNT的方式替换,但是需要注意group by造成的数据倾斜问题

select count(distinct id) from bigtable;

select count(id) from (select id from bigtable group by id) a;
-- 虽然会多用一个Job来完成，但在数据量大的情况下，这个绝对是值得的

笛卡尔积

尽量避免笛卡尔积，join的时候不加on条件，或者无效的on条件，Hive只能使用1个reducer来完成笛卡尔积

行列过滤

列处理：在SELECT中，只拿需要的列，如果有分区，尽量使用分区过滤，少用SELECT *
行处理：在分区剪裁中，当使用外关联时，如果将副表的过滤条件写在Where后面，那么就会先全表关联，之后再过滤

hive (default)> select o.id from bigtable b
join bigtable  o.id = b.id
where o.id <= 10;
-- Time taken: 34.406 seconds, Fetched: 100 row(s)

hive (default)> select b.id from bigtable b
join (select id from bigtable where id <= 10 ) o on b.id = o.id;
-- Time taken: 30.058 seconds, Fetched: 100 row(s)

分区分桶

设置MapReduce数

复杂文件增加Map数
- 当input的文件都很大，任务逻辑复杂，map执行非常慢的时候，可以考虑增加Map数
- 降低切片值增加Map数，反之减少
- 设置最大切片值为100个字节
  1
  2
  3
  computeSliteSize(Math.max(minSize,Math.min(maxSize,blocksize)))=blocksize=128M set mapreduce.input.fileinputformat.split.maxsize=100;

小文件进行合并

在map执行前合并小文件，减少map数
CombineHiveInputFormat具有对小文件进行合并的功能（系统默认的格式），HiveInputFormat没有对小文件合并功能
1
set hive.input.format= org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

在MapReduce的任务结束时合并小文件的设置

-- 在map-only任务结束时合并小文件，默认true
SET hive.merge.mapfiles = true;
-- 在map-reduce任务结束时合并小文件，默认false
SET hive.merge.mapredfiles = true;
-- 合并文件的大小，默认256M
SET hive.merge.size.per.task = 268435456;
-- 当输出文件的平均大小小于该值时，启动一个独立的map-reduce任务进行文件merge
SET hive.merge.smallfiles.avgsize = 16777216;

合理设置Reduce数

处理大数据量利用合适的reduce数；使单个reduce任务处理数据量大小要合适

调整方法1

-- 每个Reduce处理的数据量默认是256MB
hive.exec.reducers.bytes.per.reducer=256000000
-- 每个任务最大的reduce数，默认为1009
hive.exec.reducers.max=1009
-- 计算reducer数的公式
N=min(参数2，总输入数据量/参数1)

调整方法2：mapred-default.xml

1 2	`-- 设置每个job的Reduce个数 set mapreduce.job.reduces = 15;`

并行执行

-- 打开任务并行执行
set hive.exec.parallel=true;
-- 同一个sql允许最大并行度，默认为8。
set hive.exec.parallel.thread.number=16;

严格模式

防止一些危险操作
分区表不使用分区过滤
- hive.strict.checks.no.partition.filter=true：对于分区表，除非where语句中含有分区字段过滤条件来限制范围，否则不允许执行
使用order by没有limit过滤
- hive.strict.checks.orderby.no.limit=true：使用了order by语句的查询，要求必须使用limit语句
笛卡尔积
- hive.strict.checks.cartesian.product=true：限制笛卡尔积的查询

实战

需求

统计硅谷影音视频网站的常规指标，各种TopN指标：
-- 统计视频观看数Top10
-- 统计视频类别热度Top10
-- 统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数
-- 统计视频观看数Top50所关联视频的所属类别Rank
-- 统计每个类别中的视频热度Top10,以Music为例
-- 统计每个类别视频观看数Top10
-- 统计上传视频最多的用户Top10以及他们上传的视频观看次数在前20的视频

数据结构

视频表

字段	备注	详细描述
videoId	视频唯一id（String）	11位字符串
uploader	视频上传者（String）	上传视频的用户名String
age	视频年龄（int）	视频在平台上的整数天
category	视频类别（Array）	上传视频指定的视频分类
length	视频长度（Int）	整形数字标识的视频长度
views	观看次数（Int）	视频被浏览的次数
rate	视频评分（Double）	满分5分
Ratings	流量（Int）	视频的流量，整型数字
conments	评论数（Int）	一个视频的整数评论数
relatedId	相关视频id（Array）	相关视频的id，最多20个

用户表

字段备注 字段类型

uploader 上传者用户名 string

videos 上传视频数 int

friends 朋友数量 int

准备工作

ELT

视频可以有多个所属分类，每个所属分类用&符号分割，且分割的两边有空格字符，同时相关视频也是可以有多个元素，多个相关视频又用“\t”进行分割。为了分析数据时方便对存在多个子元素的数据进行操作，我们首先进行数据重组清洗操作。即：将所有的类别用“&”分割，同时去掉两边空格，多个相关视频id也使用“&”进行分割

封装工具类

/**
     * 清洗视屏数据
     *
     * @param line 一行的数据
     * @return 数据合法返回清洗完的数据，数据不合法返回null
     */
public static String videoData(String line) {
  StringBuilder res = new StringBuilder();
  String[] fields = line.split("\t");
  int len = fields.length;
  // 数据长度>9
  if (len < 9) return null;
  // 去掉视频类别中的空格
  fields[3] = fields[3].replaceAll(" ", "");
  // 将关联视频通过&拼接
  for (int i = 0; i < len; i++) {
    res.append(fields[i]);
    if (i != len - 1) {
      res.append(i > 8 ? "&" : "\t");
    }
  }
  return res.toString();
}

Mapper

public class ETLMapper extends Mapper<LongWritable, Text, Text, NullWritable> {

  private final Text outKey = new Text();

  @Override
  protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, NullWritable>.Context context) throws IOException, InterruptedException {
    String res = ETLUtils.videoData(value.toString());
    if (res == null) return;
    outKey.set(res);
    context.write(outKey, NullWritable.get());
  }

}

Driver

public class ETLDriver {

    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
        Job job = Job.getInstance(new Configuration());
        job.setJarByClass(ETLDriver.class);
        job.setMapperClass(ETLMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);
        job.setNumReduceTasks(0);
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 1 : 0);
    }

}

程序打包为etl.jar并上传到/hive/datas 目录下
上传原始数据到HDFS

ETL数据

1	`hadoop jar etl.jar cc.mousse.etl.ETLDriver /video/video/*.txt /video/video/output`

数据表

创建原始数据表

-- video_ori
create table video_ori(
    videoId string, 
    uploader string, 
    age int, 
    category array<string>, 
    length int, 
    views int, 
    rate float, 
    ratings int, 
    comments int,
    relatedId array<string>)
row format delimited fields terminated by "\t"
collection items terminated by "&"
stored as textfile
location '/video/video/output';

-- video_user_ori
create table video_user_ori(
    uploader string,
    videos int,
    friends int)
row format delimited 
fields terminated by "\t" 
stored as textfile
location '/video/user/';

创建最终表

-- video_orc，带压缩
create table video_orc(
    videoId string, 
    uploader string, 
    age int, 
    category array<string>, 
    length int, 
    views int, 
    rate float, 
    ratings int, 
    comments int,
    relatedId array<string>)
stored as orc
tblproperties("orc.compress"="SNAPPY");

-- video_user_orc
create table video_user_orc(
    uploader string,
    videos int,
    friends int)
row format delimited 
fields terminated by "\t" 
stored as orc
tblproperties("orc.compress"="SNAPPY");

-- 插入数据
load data inpath "/video/video/output" into table video_ori;
load data inpath "/video/user" into table video_user_ori;

insert into table video_orc select * from video_ori;
insert into table video_user_orc select * from video_user_ori;

安装Tez引擎（了解，现在已经不用了）

将tez**安装包拷贝到集群，并解压tar包

1 2	`mkdir /opt/module/tez tar -zxvf /opt/software/tez-0.10.1-SNAPSHOT-minimal.tar.gz -C /opt/module/tez`

上传tez依赖到HDFS

1 2	`hadoop fs -mkdir /tez hadoop fs -put /opt/software/tez-0.10.1-SNAPSHOT.tar.gz /tez`

新建tez-site.xml

<!--
vim $HADOOP_HOME/etc/hadoop/tez-site.xml
添加如下内容：
-->

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>tez.lib.uris</name>
    <value>${fs.defaultFS}/tez/tez-0.10.1-SNAPSHOT.tar.gz</value>
  </property>
  <property>
    <name>tez.use.cluster.hadoop-libs</name>
    <value>true</value>
  </property>
  <property>
    <name>tez.am.resource.memory.mb</name>
    <value>1024</value>
  </property>
  <property>
    <name>tez.am.resource.cpu.vcores</name>
    <value>1</value>
  </property>
  <property>
    <name>tez.container.max.java.heap.fraction</name>
    <value>0.4</value>
  </property>
  <property>
    <name>tez.task.resource.memory.mb</name>
    <value>1024</value>
  </property>
  <property>
    <name>tez.task.resource.cpu.vcores</name>
    <value>1</value>
  </property>
</configuration>

修改Hadoop环境变量

vim $HADOOP_HOME/etc/hadoop/shellprofile.d/tez.sh
# 添加Tez的Jar包相关信息
hadoop_add_profile tez
function _tez_hadoop_classpath
{
  hadoop_add_classpath "$HADOOP_HOME/etc/hadoop" after
  hadoop_add_classpath "/opt/module/tez/*" after
  hadoop_add_classpath "/opt/module/tez/lib/*" after
}

修改Hive的计算引擎

<!--
vim $HIVE_HOME/conf/hive-site.xml
-->

<property>
  <name>hive.execution.engine</name>
  <value>tez</value>
</property>
<property>
  <name>hive.tez.container.size</name>
  <value>1024</value>
</property>

解决日志Jar包冲突

1	`rm /opt/module/tez/lib/slf4j-log4j12-1.7.10.jar`

业务分析

-- 1. 统计视频观看数Top10
select 
  videoId,
  views 
from 
  gulivideo_orc 
order by views desc 
limit 10 ;
-- +--------------+-----------+
-- |   videoid    |   views   |
-- +--------------+-----------+
-- | dMH0bHeiRNg  | 42513417  |
-- | 0XxI-hvPRRA  | 20282464  |
-- | 1dmVU08zVpA  | 16087899  |
-- | RB-wUgnyGv0  | 15712924  |
-- | QjA5faZF1A8  | 15256922  |
-- | -_CSo1gOd48  | 13199833  |
-- | 49IDp76kjPw  | 11970018  |
-- | tYnn51C3X_w  | 11823701  |
-- | pv5zWaTEVkI  | 11672017  |
-- | D2kJZOfq7zk  | 11184051  |
-- +--------------+-----------+
==========
-- 2. 统计视频类别热度Top10 (通过类别下包含的视频个数衡量热度)
-- 1）炸开视频的类别
select 
  videoId,
  category_name
from gulivideo_orc
lateral view explode(category) gulivideo_orc_tmp as category_name  =>t1
-- 2) 按照列别分组，求每个组中视频个数
select
  t1.category_name,
  count(t1.videoId) hot
from
  (
select 
  videoId,
  category_name
from gulivideo_orc
lateral view explode(category) gulivideo_orc_tmp as category_name
  )t1
group by t1.category_name 
order by hot desc 
limit 10 ; 
-- +-------------------+---------+
-- | t1.category_name  |   hot   |
-- +-------------------+---------+
-- | Music             | 179049  |
-- | Entertainment     | 127674  |
-- | Comedy            | 87818   |
-- | Animation         | 73293   |
-- | Film              | 73293   |
-- | Sports            | 67329   |
-- | Games             | 59817   |
-- | Gadgets           | 59817   |
-- | People            | 48890   |
-- | Blogs             | 48890   |
-- +-------------------+---------+
==========
-- 3. 统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数
-- 1) 视频观看数最高的20个视频的所属类别
select
  videoId,
  views ,
  category
from 
  gulivideo_orc
order by views desc 
limit 20     =>t1 
-- +--------------+-----------+---------------------+
-- |   videoid    |   views   |      category       |
-- +--------------+-----------+---------------------+
-- | dMH0bHeiRNg  | 42513417  | ["Comedy"]          |
-- | 0XxI-hvPRRA  | 20282464  | ["Comedy"]          |
-- | 1dmVU08zVpA  | 16087899  | ["Entertainment"]   |
-- | RB-wUgnyGv0  | 15712924  | ["Entertainment"]   |
-- | QjA5faZF1A8  | 15256922  | ["Music"]           |
-- | -_CSo1gOd48  | 13199833  | ["People","Blogs"]  |
-- | 49IDp76kjPw  | 11970018  | ["Comedy"]          |
-- | tYnn51C3X_w  | 11823701  | ["Music"]           |
-- | pv5zWaTEVkI  | 11672017  | ["Music"]           |
-- | D2kJZOfq7zk  | 11184051  | ["People","Blogs"]  |
-- | vr3x_RRJdd4  | 10786529  | ["Entertainment"]   |
-- | lsO6D1rwrKc  | 10334975  | ["Entertainment"]   |
-- | 5P6UU6m3cqk  | 10107491  | ["Comedy"]          |
-- | 8bbTtPL1jRs  | 9579911   | ["Music"]           |
-- | _BuRwH59oAo  | 9566609   | ["Comedy"]          |
-- | aRNzWyD7C9o  | 8825788   | ["UNA"]             |
-- | UMf40daefsI  | 7533070   | ["Music"]           |
-- | ixsZy2425eY  | 7456875   | ["Entertainment"]   |
-- | MNxwAU_xAMk  | 7066676   | ["Comedy"]          |
-- | RUCZJVJ_M8o  | 6952767   | ["Entertainment"]   |
-- +--------------+-----------+---------------------+
-- 2) 炸开视频的类别
select
  t1.videoId,
  category_name
from 
 (
select
  videoId,
  views ,
  category
from 
  gulivideo_orc
order by views desc 
limit 20   
 )t1
lateral view explode(t1.category) t1_tmp as category_name  =>t2
-- +--------------+----------------+
-- |  t1.videoid  | category_name  |
-- +--------------+----------------+
-- | dMH0bHeiRNg  | Comedy         |
-- | 0XxI-hvPRRA  | Comedy         |
-- | 1dmVU08zVpA  | Entertainment  |
-- | RB-wUgnyGv0  | Entertainment  |
-- | QjA5faZF1A8  | Music          |
-- | -_CSo1gOd48  | People         |
-- | -_CSo1gOd48  | Blogs          |
-- | 49IDp76kjPw  | Comedy         |
-- | tYnn51C3X_w  | Music          |
-- | pv5zWaTEVkI  | Music          |
-- | D2kJZOfq7zk  | People         |
-- | D2kJZOfq7zk  | Blogs          |
-- | vr3x_RRJdd4  | Entertainment  |
-- | lsO6D1rwrKc  | Entertainment  |
-- | 5P6UU6m3cqk  | Comedy         |
-- | 8bbTtPL1jRs  | Music          |
-- | _BuRwH59oAo  | Comedy         |
-- | aRNzWyD7C9o  | UNA            |
-- | UMf40daefsI  | Music          |
-- | ixsZy2425eY  | Entertainment  |
-- | MNxwAU_xAMk  | Comedy         |
-- | RUCZJVJ_M8o  | Entertainment  |
-- +--------------+----------------+
-- 3) 按照类别分组,求每个类别下的视频个数
select 
  t2.category_name,
  count(t2.videoId) video_num
from 
 (
select
  t1.videoId,
  category_name
from 
 (
select
  videoId,
  views ,
  category
from 
  gulivideo_orc
order by views desc 
limit 20   
 )t1
lateral view explode(t1.category) t1_tmp as category_name
)t2
group by t2.category_name ;
-- +-------------------+------------+
-- | t2.category_name  | video_num  |
-- +-------------------+------------+
-- | Blogs             | 2          |
-- | Comedy            | 6          |
-- | Entertainment     | 6          |
-- | Music             | 5          |
-- | People            | 2          |
-- | UNA               | 1          |
-- +-------------------+------------+
==========
-- 4. 统计视频观看数Top50所关联视频的所属类别排名
-- 1) 统计视频观看数Top50所关联视频
select
  videoId,
  views,
  relatedId
from 
 gulivideo_orc
order by views desc 
limit 50   =>t1
-- 2) 炸开关联视频
select 
   relatedId_video
from 
(select
  videoId,
  views,
  relatedId
from 
 gulivideo_orc
order by views desc 
limit 50)t1
lateral view explode(t1.relatedId) t1_tmp as relatedId_video  => t2
-- 3) 关联原表， 求每个关联视频的类别
select
  t2.relatedId_video,
  t3.category
from 
  (
select 
   relatedId_video
from 
(select
  videoId,
  views,
  relatedId
from 
 gulivideo_orc
order by views desc 
limit 50)t1
lateral view explode(t1.relatedId) t1_tmp as relatedId_video
  )t2
join 
  gulivideo_orc t3 
on t2.relatedId_video = t3.videoId  => t4 
-- 4) 炸开类别
select 
  t4.relatedId_video,
  category_name
from 
(
select
  t2.relatedId_video,
  t3.category
from 
  (
select 
   relatedId_video
from 
(select
  videoId,
  views,
  relatedId
from 
 gulivideo_orc
order by views desc 
limit 50)t1
lateral view explode(t1.relatedId) t1_tmp as relatedId_video
  )t2
join 
  gulivideo_orc t3 
on t2.relatedId_video = t3.videoId
)t4
lateral view explode(t4.category) t4_tmp as category_name  =>t5
-- 5) 按照类别分组，求统计
select
   t5.category_name,
   count(t5.relatedId_video) video_num
from 
 (
select 
  t4.relatedId_video,
  category_name
from 
(
select
  t2.relatedId_video,
  t3.category
from 
  (
select 
   relatedId_video
from 
(select
  videoId,
  views,
  relatedId
from 
 gulivideo_orc
order by views desc 
limit 50)t1
lateral view explode(t1.relatedId) t1_tmp as relatedId_video
  )t2
join 
  gulivideo_orc t3 
on t2.relatedId_video = t3.videoId
)t4
lateral view explode(t4.category) t4_tmp as category_name
)t5
group by t5.category_name  => t6
-- 6) 求排名
select 
  t6.category_name,
  t6.video_num,
  rank() over(order by t6.video_num desc ) rk
from 
(
select
   t5.category_name,
   count(t5.relatedId_video) video_num
from 
 (
select 
  t4.relatedId_video,
  category_name
from 
(
select
  t2.relatedId_video,
  t3.category
from 
  (
select 
   relatedId_video
from 
(select
  videoId,
  views,
  relatedId
from 
 gulivideo_orc
order by views desc 
limit 50)t1
lateral view explode(t1.relatedId) t1_tmp as relatedId_video
  )t2
join 
  gulivideo_orc t3 
on t2.relatedId_video = t3.videoId
)t4
lateral view explode(t4.category) t4_tmp as category_name
)t5
group by t5.category_name 
)t6 ;
-- +-------------------+---------------+-----+
-- | t6.category_name  | t6.video_num  | rk  |
-- +-------------------+---------------+-----+
-- | Comedy            | 237           | 1   |
-- | Entertainment     | 216           | 2   |
-- | Music             | 195           | 3   |
-- | Blogs             | 51            | 4   |
-- | People            | 51            | 4   |
-- | Film              | 47            | 6   |
-- | Animation         | 47            | 6   |
-- | News              | 24            | 8   |
-- | Politics          | 24            | 8   |
-- | Games             | 22            | 10  |
-- | Gadgets           | 22            | 10  |
-- | Sports            | 19            | 12  |
-- | Howto             | 14            | 13  |
-- | DIY               | 14            | 13  |
-- | UNA               | 13            | 15  |
-- | Places            | 12            | 16  |
-- | Travel            | 12            | 16  |
-- | Animals           | 11            | 18  |
-- | Pets              | 11            | 18  |
-- | Autos             | 4             | 20  |
-- | Vehicles          | 4             | 20  |
-- +-------------------+---------------+-----+
==========
-- 5.统计每个类别中的视频热度Top10，以Music为例
-- 1) 炸开类别
select 
  videoId,
  category_name,
  views
from 
 gulivideo_orc
lateral view explode(category) gulivideo_orc_tmp as category_name =>t1
-- 2) 求Music类别下的top10
select
  t1.videoId,
  t1.category_name,
  t1.views
from 
(select 
  videoId,
  category_name,
  views
from 
 gulivideo_orc
lateral view explode(category) gulivideo_orc_tmp as category_name)t1
where t1.category_name = 'Music'
order by t1.views desc
limit 10
-- +--------------+-------------------+-----------+
-- |  t1.videoid  | t1.category_name  | t1.views  |
-- +--------------+-------------------+-----------+
-- | QjA5faZF1A8  | Music             | 15256922  |
-- | tYnn51C3X_w  | Music             | 11823701  |
-- | pv5zWaTEVkI  | Music             | 11672017  |
-- | 8bbTtPL1jRs  | Music             | 9579911   |
-- | UMf40daefsI  | Music             | 7533070   |
-- | -xEzGIuY7kw  | Music             | 6946033   |
-- | d6C0bNDqf3Y  | Music             | 6935578   |
-- | HSoVKUVOnfQ  | Music             | 6193057   |
-- | 3URfWTEPmtE  | Music             | 5581171   |
-- | thtmaZnxk_0  | Music             | 5142238   |
-- +--------------+-------------------+-----------+
==========
-- 6.统计每个类别视频观看数Top10
-- 1). 炸开类别
select 
  videoId,
  category_name,
  views
from 
 gulivideo_orc
lateral view explode(category) gulivideo_orc_tmp as category_name =>t1 
-- 2) 开窗, 按照类别分区,观看数排序，求排名
select 
  t1.videoId,
  t1.category_name,
  t1.views.
  rank() over(partition by t1.category_name order by t1.views desc ) rk 
from 
 (
select 
  videoId,
  category_name,
  views
from 
 gulivideo_orc
lateral view explode(category) gulivideo_orc_tmp as category_name 
 )t1   =>t2
 -- 3) 求top10 
select 
  t2.videoId,
  t2.category_name,
  t2.views ,
  t2.rk
from 
 (
select 
  t1.videoId,
  t1.category_name,
  t1.views ,
  rank() over(partition by t1.category_name order by t1.views desc ) rk 
from 
 (
select 
  videoId,
  category_name,
  views
from 
 gulivideo_orc
lateral view explode(category) gulivideo_orc_tmp as category_name 
 )t1 
)t2
where t2.rk <=10 ; 
==========
-- 7.统计上传视频最多的用户Top10以及他们上传的视频观看次数在前20的视频
-- 1) 统计上传视频最多的用户Top10
select
  uploader,
  videos
from
  gulivideo_user_orc
order by videos desc 
limit 10  =>t1
-- +---------------------+---------+
-- |      uploader       | videos  |
-- +---------------------+---------+
-- | expertvillage       | 86228   |
-- | TourFactory         | 49078   |
-- | myHotelVideo        | 33506   |
-- | AlexanderRodchenko  | 24315   |
-- | VHTStudios          | 20230   |
-- | ephemeral8          | 19498   |
-- | HSN                 | 15371   |
-- | rattanakorn         | 12637   |
-- | Ruchaneewan         | 10059   |
-- | futifu              | 9668    |
-- +---------------------+---------+
-- 2) 关联视频表，求他们上传的视频
select
  t1.uploader,
  t2.videoId,
  t2.views
from 
 (
select
  uploader,
  videos
from
  gulivideo_user_orc
order by videos desc 
limit 10
 )t1
join 
 gulivideo_orc t2 
on t1.uploader = t2.uploader
order by t2.views desc 
limit 20 ;
-- +----------------+--------------+-----------+
-- |  t1.uploader   |  t2.videoid  | t2.views  |
-- +----------------+--------------+-----------+
-- | expertvillage  | -IxHBW0YpZw  | 39059     |
-- | expertvillage  | BU-fT5XI_8I  | 29975     |
-- | expertvillage  | ADOcaBYbMl0  | 26270     |
-- | expertvillage  | yAqsULIDJFE  | 25511     |
-- | expertvillage  | vcm-t0TJXNg  | 25366     |
-- | expertvillage  | 0KYGFawp14c  | 24659     |
-- | expertvillage  | j4DpuPvMLF4  | 22593     |
-- | expertvillage  | Msu4lZb2oeQ  | 18822     |
-- | expertvillage  | ZHZVj44rpjE  | 16304     |
-- | expertvillage  | foATQY3wovI  | 13576     |
-- | expertvillage  | -UnQ8rcBOQs  | 13450     |
-- | expertvillage  | crtNd46CDks  | 11639     |
-- | expertvillage  | D1leA0JKHhE  | 11553     |
-- | expertvillage  | NJu2oG1Wm98  | 11452     |
-- | expertvillage  | CapbXdyv4j4  | 10915     |
-- | expertvillage  | epr5erraEp4  | 10817     |
-- | expertvillage  | IyQoDgaLM7U  | 10597     |
-- | expertvillage  | tbZibBnusLQ  | 10402     |
-- | expertvillage  | _GnCHodc7mk  | 9422      |
-- | expertvillage  | hvEYlSlRitU  | 7123      |
-- +----------------+--------------+-----------+

常见错误

更换Tez引擎后执行任务卡住，可以尝试调节容量调度器的资源调度策略

<!--将$HADOOP_HOME/etc/hadoop/capacity-scheduler.xml文件中的-->
<property>
  <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
  <value>0.1</value>
  <description>
    Maximum percent of resources in the cluster which can be used to run 
    application masters i.e. controls number of concurrent running
    applications.
  </description>
</property>
<!--改成-->
<property>
  <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
  <value>1</value>
  <description>
    Maximum percent of resources in the cluster which can be used to run 
    application masters i.e. controls number of concurrent running
    applications.
  </description>
</property>

连接不上mysql数据库
- 导错驱动包，应该把mysql-connector-java-5.1.27-bin.jar导入/opt/module/hive/lib的不是这个包。错把mysql-connector-java-5.1.27.tar.gz导入hive/lib包下
- 修改user表中的主机名称没有都修改为%，而是修改为localhost

hive默认的输入格式处理是CombineHiveInputFormat，会对小文件进行合并

set hive.input.format;
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
-- 可以采用HiveInputFormat就会根据分区数输出相应的文件。
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

不能执行mapreduce程序
- 可能是hadoop的yarn没开启
启动mysql服务时，报MySQL server PID file could not be found!异常
- 在/var/lock/subsys/mysql路径下创建hadoop102.pid，并在文件中添加内容：4396
报service mysql status MySQL is not running, but lock file (/var/lock/subsys/mysql[失败])异常
- 在/var/lib/mysql 目录下创建： -rw-rw—-. 1 mysql mysql 5 12月 22 16:41 hadoop102.pid 文件，并修改权限为 777

JVM堆内存溢

在yarn-site.xml中加入如下代码

<property>
  <name>yarn.scheduler.maximum-allocation-mb</name>
  <value>2048</value>
</property>
<property>
  <name>yarn.scheduler.minimum-allocation-mb</name>
  <value>2048</value>
</property>
<property>
  <name>yarn.nodemanager.vmem-pmem-ratio</name>
  <value>2.1</value>
</property>
<property>
  <name>mapred.child.java.opts</name>
  <value>-Xmx1024m</value>
</property>

虚拟内存限制

在yarn-site.xml中添加如下配置

<property>
  <name>yarn.nodemanager.vmem-check-enabled</name>
  <value>false</value>
</property>

大数据

Hive

http://docs.mousse.cc/Hive/

作者

Mocha Mousse

发布于

2025年5月26日

许可协议

Hadoop 上一篇

JDBC 下一篇

字段	备注	字段类型
uploader	上传者用户名	string
videos	上传视频数	int
friends	朋友数量	int