龙空技术网

HiveSQL常用的一些小技巧

大数据与人工智能分享 177

前言:

此刻我们对“c语言中a012”大体比较注意,大家都需要分析一些“c语言中a012”的相关文章。那么小编也在网络上网罗了一些对于“c语言中a012””的相关资讯,希望朋友们能喜欢,姐妹们快快来了解一下吧!

SORT_ARRAY

函数声明如下。

ARRAY sort_array(ARRAY<T>)

用途:对给定数组中的数据排序。

参数说明:ARRAY,ARRAY类型的数据。数组中的数据可为任意类型。

返回值:ARRAY类型。

示例如下。

--建表CREATE TABLE sort_array(    c1 ARRAY<STRING>    ,c2 ARRAY<INT>);--装载数据INSERT OVERWRITE TABLE sort_arraySELECT  array('d','c','b','a')  AS c1        ,array(4,3,2,1)         AS c2;--查询SELECT  sort_array(c1)        ,sort_array(c2)FROM    sort_array;--结果["a","b","c","d"] [1,2,3,4]
分析函数

基本语法

analytic_function_name([argument_list])OVER ([PARTITION BY partition_expression,…][ORDER BY sort_expression, … [ASC|DESC]])
analytic_function_name: 函数名称 — 比如 RANK(), SUM(), FIRST()等等partition_expression: 分区列sort_expression: 排序列案例数据准备
CREATE TABLE `orders_new` (    `order_num` String COMMENT '订单号',    `order_amount` DECIMAL ( 12, 2 ) COMMENT '订单金额',    `advance_amount` DECIMAL ( 12, 2 ) COMMENT '预付款',    `order_date` string COMMENT '订单日期',    `cust_code` string COMMENT '客户',    `agent_code` string COMMENT '代理商' );INSERT INTO orders VALUES('200100', '1000.00', '600.00', '2020-08-01', 'C00013', 'A003');INSERT INTO orders VALUES('200110', '3000.00', '500.00', '2020-04-15', 'C00019', 'A010');INSERT INTO orders VALUES('200107', '4500.00', '900.00', '2020-08-30', 'C00007', 'A010');INSERT INTO orders VALUES('200112', '2000.00', '400.00', '2020-05-30', 'C00016', 'A007'); INSERT INTO orders VALUES('200113', '4000.00', '600.00', '2020-06-10', 'C00022', 'A002');INSERT INTO orders VALUES('200102', '2000.00', '300.00', '2020-05-25', 'C00012', 'A012');INSERT INTO orders VALUES('200114', '3500.00', '2000.00', '2020-08-15', 'C00002','A008');INSERT INTO orders VALUES('200122', '2500.00', '400.00', '2020-09-16', 'C00003', 'A004');INSERT INTO orders VALUES('200118', '500.00', '100.00', '2020-07-20', 'C00023', 'A006');INSERT INTO orders VALUES('200119', '4000.00', '700.00', '2020-09-16', 'C00007', 'A010');INSERT INTO orders VALUES('200121', '1500.00', '600.00', '2020-09-23', 'C00008', 'A004');INSERT INTO orders VALUES('200130', '2500.00', '400.00', '2020-07-30', 'C00025', 'A011');INSERT INTO orders VALUES('200134', '4200.00', '1800.00', '2020-09-25', 'C00004','A005');INSERT INTO orders VALUES('200108', '4000.00', '600.00', '2020-02-15', 'C00008', 'A004');INSERT INTO orders VALUES('200103', '1500.00', '700.00', '2020-05-15', 'C00021', 'A005');INSERT INTO orders VALUES('200105', '2500.00', '500.00', '2020-07-18', 'C00025', 'A011');INSERT INTO orders VALUES('200109', '3500.00', '800.00', '2020-07-30', 'C00011', 'A010');INSERT INTO orders VALUES('200101', '3000.00', '1000.00', '2020-07-15', 'C00001','A008');INSERT INTO orders VALUES('200111', '1000.00', '300.00', '2020-07-10', 'C00020', 'A008');INSERT INTO orders VALUES('200104', '1500.00', '500.00', '2020-03-13', 'C00006', 'A004');INSERT INTO orders VALUES('200106', '2500.00', '700.00', '2020-04-20', 'C00005', 'A002');INSERT INTO orders VALUES('200125', '2000.00', '600.00', '2020-10-01', 'C00018', 'A005');INSERT INTO orders VALUES('200117', '800.00', '200.00', '2020-10-20', 'C00014', 'A001');INSERT INTO orders VALUES('200123', '500.00', '100.00', '2020-09-16', 'C00022', 'A002');INSERT INTO orders VALUES('200120', '500.00', '100.00', '2020-07-20', 'C00009', 'A002');INSERT INTO orders VALUES('200116', '500.00', '100.00', '2020-07-13', 'C00010', 'A009');INSERT INTO orders VALUES('200124', '500.00', '100.00', '2020-06-20', 'C00017', 'A007'); INSERT INTO orders VALUES('200126', '500.00', '100.00', '2020-06-24', 'C00022', 'A002');INSERT INTO orders VALUES('200129', '2500.00', '500.00', '2020-07-20', 'C00024', 'A006');INSERT INTO orders VALUES('200127', '2500.00', '400.00', '2020-07-20', 'C00015', 'A003');INSERT INTO orders VALUES('200128', '3500.00', '1500.00', '2020-07-20', 'C00009','A002');INSERT INTO orders VALUES('200135', '2000.00', '800.00', '2020-09-16', 'C00007', 'A010');INSERT INTO orders VALUES('200131', '900.00', '150.00', '2020-08-26', 'C00012', 'A012');INSERT INTO orders VALUES('200133', '1200.00', '400.00', '2020-06-29', 'C00009', 'A002');
排序累加
SELECT    agent_code,    order_date,    order_amount,    SUM( order_amount ) OVER ( PARTITION BY agent_code ORDER BY order_date desc rows BETWEEN unbounded preceding AND current row  ) total_rev FROMorders_newWHEREorder_date >= '2020-07-01' AND order_date <= '2020-09-30';
结果
A002 2020-09-16 500.00 500.00A002 2020-07-20 3500.00 4000.00A002 2020-07-20 500.00 4500.00A003 2020-08-01 1000.00 1000.00A003 2020-07-20 2500.00 3500.00A004 2020-09-23 1500.00 1500.00A004 2020-09-16 2500.00 4000.00A005 2020-09-25 4200.00 4200.00A006 2020-07-20 2500.00 2500.00A006 2020-07-20 500.00 3000.00A008 2020-08-15 3500.00 3500.00A008 2020-07-15 3000.00 6500.00A008 2020-07-10 1000.00 7500.00A009 2020-07-13 500.00 500.00A010 2020-09-16 2000.00 2000.00A010 2020-09-16 4000.00 6000.00A010 2020-08-30 4500.00 10500.00A010 2020-07-30 3500.00 14000.00A011 2020-07-30 2500.00 2500.00A011 2020-07-18 2500.00 5000.00A012 2020-08-26 900.00 900.00
AVG() 和SUM()需求描述:

第三季度每个代理商的移动平均收入和总收入

SELECT    agent_code,    order_date,    AVG( order_amount ) OVER ( PARTITION BY agent_code ORDER BY order_date)  avg_rev,    SUM( order_amount ) OVER ( PARTITION BY agent_code ORDER BY order_date ) total_rev FROMorders WHEREorder_date >= '2020-07-01' AND order_date <= '2020-09-30';
结果输出
A002    2020-07-20      2000    4000A002    2020-07-20      2000    4000A002    2020-09-16      1500    4500A003    2020-07-20      2500    2500A003    2020-08-01      1750    3500A004    2020-09-16      2500    2500A004    2020-09-23      2000    4000A005    2020-09-25      4200    4200A006    2020-07-20      1500    3000A006    2020-07-20      1500    3000A008    2020-07-10      1000    1000A008    2020-07-15      2000    4000A008    2020-08-15      2500    7500A009    2020-07-13      500     500A010    2020-07-30      3500    3500A010    2020-08-30      4000    8000A010    2020-09-16      3500    14000A010    2020-09-16      3500    14000A011    2020-07-18      2500    2500A011    2020-07-30      2500    5000A012    2020-08-26      900     900
FIRST_VALUE()和 LAST_VALUE()first_value: 取分组内排序后,截止到当前行,第一个值last_value: 取分组内排序后,截止到当前行,最后一个值需求描述

客户首次购买后多少天才进行下一次购买

SELECT    cust_code,    order_date,    datediff(order_date,FIRST_VALUE ( order_date ) OVER ( PARTITION BY cust_code ORDER BY order_date )) next_order_gap FROMorders order by cust_code,next_order_gap
结果输出
C00001  2020-07-15      0C00002  2020-08-15      0C00003  2020-09-16      0C00004  2020-09-25      0C00005  2020-04-20      0C00006  2020-03-13      0C00007  2020-08-30      0C00007  2020-09-16      17C00007  2020-09-16      17C00008  2020-02-15      0C00008  2020-09-23      221C00009  2020-06-29      0C00009  2020-07-20      21C00009  2020-07-20      21C00010  2020-07-13      0C00011  2020-07-30      0C00012  2020-05-25      0C00012  2020-08-26      93C00013  2020-08-01      0C00014  2020-10-20      0C00015  2020-07-20      0C00016  2020-05-30      0C00017  2020-06-20      0C00018  2020-10-01      0C00019  2020-04-15      0C00020  2020-07-10      0C00021  2020-05-15      0C00022  2020-06-10      0C00022  2020-06-24      14C00022  2020-09-16      98C00023  2020-07-20      0C00024  2020-07-20      0C00025  2020-07-18      0C00025  2020-07-30      12
LEAD() 和 LAG()lead(value_expr[,offset[,default]]):用于统计窗口内往下第n行值。第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULLlag(value_expr[,offset[,default]]): 与lead相反,用于统计窗口内往上第n行值。第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)需求描述

代理商最近一次出售的最高订单金额是多少?

SELECT agent_code, order_amount, LAG ( order_amount, 1 ) OVER ( PARTITION BY agent_code ORDER BY order_amount DESC ) last_highest_amount FROM orders ORDER BY agent_code, order_amount DESC;
结果输出
A001    800     NULLA002    4000    NULLA002    3500    4000A002    2500    3500A002    1200    2500A002    500     1200A002    500     500A002    500     500A003    2500    NULLA003    1000    2500A004    4000    NULLA004    2500    4000A004    1500    2500A004    1500    1500A005    4200    NULLA005    2000    4200A005    1500    2000A006    2500    NULLA006    500     2500A007    2000    NULLA007    500     2000A008    3500    NULLA008    3000    3500A008    1000    3000A009    500     NULLA010    4500    NULLA010    4000    4500A010    3500    4000A010    3000    3500A010    2000    3000A011    2500    NULLA011    2500    2500A012    2000    NULLA012    900     2000
RANK() 和DENSE_RANK()

**rank:**对组中的数据进行排名,如果名次相同,则排名也相同,但是下一个名次的排名序号会出现不连续。比如查找具体条件的topN行。RANK() 排序为 (1,2,2,4)

**dense_rank:**dense_rank函数的功能与rank函数类似,dense_rank函数在生成序号时是连续的,而rank函数生成的序号有可能不连续。当出现名次相同时,则排名序号也相同。而下一个排名的序号与上一个排名序号是连续的。

DENSE_RANK() 排序为 (1,2,2,3)

需求描述

每月第二高的订单金额是多少?

SELECT order_num, order_date, order_amount, order_month FROM (SELECT order_num, order_date, order_amount, DATE_FORMAT( order_date, 'YYYY-MM' ) AS order_month, DENSE_RANK ( ) OVER ( PARTITION BY DATE_FORMAT( order_date, 'YYYY-MM' ) ORDER BY order_amount DESC ) order_rank FROM orders  ) t WHERE order_rank = 2 ORDER BY order_date;
结果输出
200106  2020-04-20      2500    2020-04200103  2020-05-15      1500    2020-05200133  2020-06-29      1200    2020-06200101  2020-07-15      3000    2020-07200114  2020-08-15      3500    2020-08200119  2020-09-16      4000    2020-09200117  2020-10-20      800     2020-10
REGEXP_EXTRACT命令格式string regexp_extract(string <source>, string <pattern>[, bigint <occurrence>])命令说明将字符串source按照pattern的规则拆分为组,返回第occurrence个组的字符串。参数说明source:必填。STRING类型,待拆分的字符串。pattern:必填。STRING类型常量或正则表达式。待匹配的模型。occurrence:可选。BIGINT类型常量,必须大于等于0。返回值说明返回STRING类型。返回规则如下:如果pattern为空串或pattern中没有分组,返回报错。occurrence非BIGINT类型或小于0时,返回报错。不指定时默认为1,表示返回第一个组。如果occurrence等于0,则返回满足整个pattern的子串。sourcepatternoccurrence值为NULL时,返回NULL。示例select regexp_extract('foothebar', '(foo)(.*?)(bar)', 0); --返回foothebar

select regexp_extract('foothebar', '(foo)(.*?)(bar)', 1); --返回foo

select regexp_extract('foothebar', '(foo)(.*?)(bar)', 2); --返回the

select regexp_extract('foothebar', '(foo)(.*?)(bar)', 3); --返回bar多行数据合并为一行数据WM_CONCAT命令格式string wm_concat(string <separator>, string <colname>)命令说明用指定的separator做分隔符,连接colname中的值。参数说明separator:必填。STRING类型常量,分隔符。colname:必填。STRING类型。如果输入为BIGINT、DOUBLE或DATETIME类型,会隐式转换为STRING类型后参与运算。返回值说明返回STRING类型。返回规则如下:separator非STRING类型常量时,返回报错。colname非STRING、BIGINT、DOUBLE或DATETIME类型时,返回报错。colname值为NULL时,该行不会参与计算。示例

--建表CREATE TABLE stu (    class STRING    ,gender STRING    ,name STRING);--装载数据INSERT INTO TABLE stu SELECT  '1','M','lilei';INSERT INTO TABLE stu SELECT  '1','F','hanmeimei';INSERT INTO TABLE stu SELECT  '1','M','jim';INSERT INTO TABLE stu SELECT  '1','M','hanmeimei';INSERT INTO TABLE stu SELECT  '2','F','tom';INSERT INTO TABLE stu SELECT  '2','M','peter';--查询SELECT class, wm_concat(distinct ',', name) FROM stu GROUP BY class;
KEYVALUE命令格式keyvalue(string <str>,[string <split1>,string <split2>,] string <key>)

keyvalue(string <str>,string <key>) 命令说明将字符串str按照split1分成Key-Value对,并按split2将Key-Value对分开,返回key所对应的Value。参数说明 即默认的分隔符是**;,KV之间的分割是:**key:必填。STRING类型。将字符串按照split1split2拆分后,返回key值对应的Value。str:必填。STRING类型。待拆分的字符串。split1split2:可选。STRING类型。用于作为分隔符的字符串,按照指定的两个分隔符拆分源字符串。如果表达式中没有指定这两项,默认split1";"split2":"。当某个被split1拆分后的字符串中有多个split2时,返回结果未定义。返回值说明返回STRING类型。返回规则如下:split1split2值为NULL时,返回NULL。strkey值为NULL或没有匹配的key时,返回NULL。如果有多个Key-Value匹配,返回第一个匹配上的key对应的Value。

select keyvalue('0:1\;1:2', 1);  --返回2select keyvalue('spm=123.qwe,cpn=101,act=890',',','=','spm')  ----返回123.qwe
优化相关distribute by+sort by V.S order byorder by将结果按某字段全局排序,这会导致所有map端数据都进入一个reducer中,在数据量大时可能会长时间计算不完distribute by用于控制map端数据分配到reducer的key,sort by会视情况启动多个reducer进行排序,并且保证每个reducer内局部有序group by V.S count(distinct)当数据量级很大,用group by ,可以启动多个job数据集很小或者key的倾斜比较明显时,用count(distinct),少量的reduce就可以处理map joinHive会将build table和probe table在map端直接完成join过程,消灭了reduce,效率很高set hive.auto.convent.join=true; /*+MAPJOIN(t1,t3,t4)*/

标签: #c语言中a012