hive常用操作笔记
# 文档收集
一些基本操作 https://cwiki.apache.org/confluence/display/Hive/GettingStarted
Hive内置函数说明文档(含常用UDAF、UDTF等)https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
# 一次查询,数据输出到多张表中
在同一组hive sql语句中,如何将数据输出到多张表?
FROM src
INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;
2
3
4
5
ref: https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-MULTITABLEINSERT
# 删除某个分区
alter table table_name drop partition (stat_year_month>='2018-01');
# 获取最新分区
select * from table_full_name
where date = max_pt('table_full_name')
2
命令说明: 对于分区的表,此函数返回该分区表的一级分区的最大值,以及该分区下所有的数据文件并按照字母排序。
# 分组排序号
按一些字段分组,增加该条数据在自己所在组中的序号。
格式:
rank/row_number/dense_rank() over (partition by field1, field2 order by order_field desc) as rank_column
2
通过举例直观理解 rank/row_number/dense_rank 的区别:
SELECT *,
RANK() OVER(PARTITION BY StyleID ORDER BY ID) AS 'RANK',
ROW_NUMBER() OVER(PARTITION BY StyleID ORDER BY ID) AS 'ROW_NUMBER',
DENSE_RANK() OVER(PARTITION BY StyleID ORDER BY ID) AS 'DENSE_RANK'
FROM T
2
3
4
5
结果:
StyleID ID RANK ROW_NUMBER DENSE_RANK
----------- -------- --------- --------------- ----------
1 1 1 1 1
1 1 1 2 1
1 1 1 3 1
1 2 4 4 2
2
3
4
5
6
ref: https://stackoverflow.com/a/7747342
# array、map字段的拆分
问题场景:比如表有两列, uid 和另外一个 array或者map字段,现在想要把array的每个值拆出来单独作为一行(uid, value),或者把map的每一个kv单独拆成一行(uid, key, value)。
explode()可以完成拆分:
select explode(array('A','B','C')) as col;
select explode(map('A',10,'B',20,'C',30)) as (key,value);
2
但是不可以再同时select其他字段了:
-- 比如下面的写法就不对
select uid, explode(array('A','B','C')) as col;
2
所以需要借助lateral_view。最终写法:
-- array
SELECT uid, a_value
FROM original_table LATERAL VIEW explode(array_field) temp_table_name AS a_value;
-- map
SELECT uid, m_key, m_value
FROM original_table LATERAL VIEW explode(map_field) temp_table_name AS m_key, m_value;
2
3
4
5
6
7
# 计算日期差值
datediff(string enddate, string startdate)
Returns the number of days from startdate to enddate: datediff('2009-03-01', '2009-02-27') = 2.