MaxCompute 实现增量数据推送(全量比对增量逻辑)

阿里云云栖号 11-22 957

前言：

今天你们对“多个case when”大概比较着重，大家都想要学习一些“多个case when”的相关资讯。那么小编也在网摘上网罗了一些有关“多个case when””的相关内容，希望你们能喜欢，咱们快快来了解一下吧！

ODPS 2.0 支持了很多新的集合命令(专有云升级到3版本后陆续支持)，简化了日常工作中求集合操作的繁琐程度。增加的SQL语法包括：UNOIN ALL、UNION DISTINCT并集，INTERSECT ALL、INTERSECTDISTINCT交集，EXCEPT ALL、EXCEPT DISTINCT补集。语法格式如下：

select_statement UNION ALL select_statement;select_statement UNION [DISTINCT] select_statement;select_statement INTERSECT ALL select_statement;select_statement INTERSECT [DISTINCT] select_statement;select_statement EXCEPT ALL select_statement;select_statement EXCEPT [DISTINCT] select_statement;select_statement MINUS ALL select_statement;select_statement MINUS [DISTINCT] select_statement;

用途：分别求两个数据集的并集、交集以及求第二个数据集在第一个数据集中的补集。参数说明：

UNION：求两个数据集的并集，即将两个数据集合并成一个数据集。INTERSECT：求两个数据集的交集。即输出两个数据集均包含的记录。EXCEPT：求第二个数据集在第一个数据集中的补集。即输出第一个数据集包含而第二个数据集不包含的记录。MINUS：等同于EXCEPT。

实际项目中有一个利用两日全量数据，比对出增量的需求（推送全量数据速度很慢，ADB／DRDS等产品数据量超过1亿，建议试用增量同步）。我按照旧的JOIN方法和新的集合方法做了下比对验证，试用了下新的集合命令EXCEPT ALL。

测试

-- 方法一：JOIN-- other_columns 代表很多列create table tmp_opcode1 asselect * from(select uuid,other_columns,opcode2from(-- 今日新增+今日变化select t1.uuid,t1.other_columns,case when t2.uuid is null then 'I' else 'U' end AS opcode2  from            prject1.table1 t1  left outer join prject1.table1 t2    on t1.uuid=t2.uuid   and t2.dt='20200730' where t1.dt='20200731'   and(t2.uuid is null    or coalesce(t1.other_columns,'')<>coalesce(t2.other_columns,''))union all-- 今日删除select t2.uuid,t2.other_columns,'D' as opcode2  from            prject1.table1 t2  left outer join prject1.table1 t1    on t1.uuid=t2.uuid   and t1.dt='20200731' where t2.dt='20200730'   and t1.uuid is null)t3)t4;

Summary:resource cost: cpu 13.37 Core * Min, memory 30.48 GB * Mininputs:prject1.table1/dt=20200730: 32530802 (946172216 bytes)prject1.table1/dt=20200731: 32533538 (947161664 bytes)outputs:prject1.tmp_opcode1: 4506 (271632 bytes)Job run time: 26.000

-- 方法二：集合-- other_columns 代表很多列create table  tmp_opcode2 asselect * from(select t3.*from(-- 今日新增+今日变化select uuid,other_columns,'I' as opcode2from(select uuid,other_columnsfrom prject1.table1where dt = '20200731'except allselect uuid,other_columnsfrom prject1.table1where dt = '20200730')tunion all-- 今日删除select t2.uuid,t2.other_columns,'D' as opcode2  from            prject1.table1 t2  left outer join prject1.table1 t1    on t1.uuid=t2.uuid   and t1.dt='20200731' where t2.dt='20200730'   and t1.uuid is null)t3)t4;

Summary:resource cost: cpu 35.92 Core * Min, memory 74.26 GB * Mininputs:prject1.table1/rfq=20200730: 32530802 (946172216 bytes)prject1.table1/rfq=20200731: 32533538 (947161664 bytes)outputs:prject1.tmp_opcode2: 4506 (259416 bytes)Job run time: 66.000

性能

集合的方法比JOIN的方法，在资源（1倍）使用和时间（1倍）上都有较多的劣势。建议实际使用JOIN方法。

结果

通过多种方法比对验证，两种方法的增量识别均正确，可以向下游提供增量数据。

本文为阿里云原创内容，未经允许不得转载。

本文地址：http://www.longkongtuishu.com/cabf1BmsBBVQDD1A.html

标签： #多个case when