前言:
如今我们对“数据比对sql怎么写”大约比较重视,咱们都需要分析一些“数据比对sql怎么写”的相关内容。那么小编也在网络上搜集了一些关于“数据比对sql怎么写””的相关内容,希望大家能喜欢,看官们快快来了解一下吧!数据分析第一步,一定是先从数据库提取数据做基本的描述统计分析,整体查看做到心中有数,而80%的描述统计需要用SQL尝试取数后,或放到Excel或其他Tableau等工具,待基本了解数据整体现状之后,才会上模型上算法,去进一步挖掘数据价值。
不要忽略这第一步,实际工作中,它是极其高频的使用场景,80%的数据需求依赖于SQL解决,不会SQL的数据分析师, 只能站在数据仓库的门口望洋兴叹,因为自己手里没有"钥匙",没法进去盘点库存。
今儿大白话说《如何用SQL做数据分析》
1. 目的
用最简单的例子,达到80%的效果:
通过一个实例数据库,数据足够简单, 粗略演示下最基本的数据探查过程。真实的工作环境,数据环境可能会复杂些,我们一步一步来,以后会写更多文章来补充。
这篇文章,希望你能get到下面几点
常用SQL练手数据探查思路隐藏价值提取2. 数据网上共享Mysql实例数据库: employees, 以及安装方法
如果不想花时间搭建环境,我已经在云端搭建了数据学习环境,按照下面文章中第三部分“学习环境”的说明,申请远程访问权限:
2019年,数据民工的大白话规划
数据库表结构
mysql> show tables;+---------------------+| Tables_in_employees |+---------------------+| departments || dept_emp || dept_manager || employees || salaries || titles |+---------------------+# 部门表, 记录部门编号和部门名称CREATE TABLE `departments` ( `dept_no` char(4) NOT NULL, `dept_name` varchar(40) NOT NULL, PRIMARY KEY (`dept_no`), UNIQUE KEY `dept_name` (`dept_name`)) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4# 部门员工表,记录各部门下属员工编号, 以及员工在该部门工作的起始和终止日期CREATE TABLE `dept_emp` ( `emp_no` int(11) NOT NULL, `dept_no` char(4) NOT NULL, `from_date` date NOT NULL, `to_date` date NOT NULL, PRIMARY KEY (`emp_no`,`dept_no`), KEY `emp_no` (`emp_no`), KEY `dept_no` (`dept_no`), CONSTRAINT `dept_emp_ibfk_1` FOREIGN KEY (`emp_no`) REFERENCES `employees` (`emp_no`) ON DELETE CASCADE, CONSTRAINT `dept_emp_ibfk_2` FOREIGN KEY (`dept_no`) REFERENCES `departments` (`dept_no`) ON DELETE CASCADE) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4# 部门经理表, 记录对应的部门经理的员工编号, 以及在该部门工作的起始和终止日期CREATE TABLE `dept_manager` ( `dept_no` char(4) NOT NULL, `emp_no` int(11) NOT NULL, `from_date` date NOT NULL, `to_date` date NOT NULL, PRIMARY KEY (`emp_no`,`dept_no`), KEY `emp_no` (`emp_no`), KEY `dept_no` (`dept_no`), CONSTRAINT `dept_manager_ibfk_1` FOREIGN KEY (`emp_no`) REFERENCES `employees` (`emp_no`) ON DELETE CASCADE, CONSTRAINT `dept_manager_ibfk_2` FOREIGN KEY (`dept_no`) REFERENCES `departments` (`dept_no`) ON DELETE CASCADE) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4# 员工信息表, 记录员工初始日期, 姓名, 性别和雇佣日期信息CREATE TABLE `employees` ( `emp_no` int(11) NOT NULL, `birth_date` date NOT NULL, `first_name` varchar(14) NOT NULL, `last_name` varchar(16) NOT NULL, `gender` enum('M','F') NOT NULL, `hire_date` date NOT NULL, PRIMARY KEY (`emp_no`)) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4# 薪水表, 记录员工某段日期的薪水CREATE TABLE `salaries` ( `emp_no` int(11) NOT NULL, `salary` int(11) NOT NULL, `from_date` date NOT NULL, `to_date` date NOT NULL, PRIMARY KEY (`emp_no`,`from_date`), KEY `emp_no` (`emp_no`), CONSTRAINT `salaries_ibfk_1` FOREIGN KEY (`emp_no`) REFERENCES `employees` (`emp_no`) ON DELETE CASCADE) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4# 岗位表, 记录员工的岗位信息, 即该岗位任职起始和终止日期信息CREATE TABLE `titles` ( `emp_no` int(11) NOT NULL, `title` varchar(50) NOT NULL, `from_date` date NOT NULL, `to_date` date DEFAULT NULL, PRIMARY KEY (`emp_no`,`title`,`from_date`), KEY `emp_no` (`emp_no`), CONSTRAINT `titles_ibfk_1` FOREIGN KEY (`emp_no`) REFERENCES `employees` (`emp_no`) ON DELETE CASCADE) ENGINE=InnoDB DEFAULT CHARSET=utf8mb43. 好奇心
几张简单的表, 放在你面前, 你会好奇那些问题? 我先抛砖:
数据能否反应公司发展趋势?
公司各年工资支出公司各年入职和离职数量
数据能否反应公司对男女员工价值观?
公司男女比例公司年龄分布
数据能否反应那种岗位涨薪最快?
不同岗位, 每年收入增长有岗位升迁的员工, 每年收入增长4. 满足好奇心
针对上面的好奇心,你可以先尝试自己完成数据提取和分析, 然后对比下面的思路
公司发展趋势
公司各年工资支出 :线性递增
mysql> select year(from_date), sum(salary) as cost from salaries group by year(from_date);+-----------------+-------------+| year(from_date) | cost |+-----------------+-------------+| 1985 | 972864875 || 1986 | 2052895941 || 1987 | 3156881054 || 1988 | 4295598688 || 1989 | 5454260439 || 1990 | 6626146391 || 1991 | 7798804412 || 1992 | 9027872610 || 1993 | 10215059054 || 1994 | 11429450113 || 1995 | 12638817464 || 1996 | 13888587737 || 1997 | 15056011781 || 1998 | 16220495471 || 1999 | 17360258862 || 2000 | 17535667603 || 2001 | 17507737308 || 2002 | 10243347616 |+-----------------+-------------+18 rows in set (1.86 sec)
上面数据,放到Excel表里,很容易生成一个柱状图:
公司各年入职和离职数量: 入大于出,但离职率在逐年递增
# 每年入职mysql> select year(tt.from_date) as from_date, count(*) as num from (select emp_no, min(from_date) as from_date, max(to_date) as to_date from dept_emp group by emp_no) tt group by year(tt.from_date) order by from_date;+-----------+-------+| from_date | num |+-----------+-------+| 1985 | 18293 || 1986 | 19878 || 1987 | 19987 || 1988 | 20103 || 1989 | 20310 || 1990 | 20042 || 1991 | 19817 || 1992 | 20203 || 1993 | 19914 || 1994 | 19894 || 1995 | 19875 || 1996 | 20176 || 1997 | 20120 || 1998 | 19828 || 1999 | 19848 || 2000 | 1736 |+-----------+-------+16 rows in set (0.71 sec)# 每年离职mysql> select year(tt.to_date) as to_date, count(*) as num from (select emp_no, min(from_date) as from_date, max(to_date) as to_date from dept_emp group by emp_no) tt group by year(tt.to_date) order by to_date;+---------+--------+| to_date | num |+---------+--------+| 1985 | 89 || 1986 | 374 || 1987 | 668 || 1988 | 905 || 1989 | 1288 || 1990 | 1594 || 1991 | 1883 || 1992 | 2317 || 1993 | 2742 || 1994 | 3135 || 1995 | 3739 || 1996 | 4330 || 1997 | 5069 || 1998 | 5941 || 1999 | 6964 || 2000 | 7610 || 2001 | 7241 || 2002 | 4011 || 9999 | 240124 | ====> 表示当前在职+---------+--------+19 rows in set (0.57 sec)
通过上面数据,尝试计算下面两个指标:
每年在职员工数量每年离职率
是不是很容易,下面的数据,自己尝试画下图?
当然,具体绘图不是重点,重点是数据反应背后的现状
图只是让数据显得直观,更容易发现数据的特征
员工价值观
公司男女和年龄交叉分布
mysql> select gender, round((to_days('2002-08-01') - to_days(birth_date))/365,0) as ages, count(*) from (select emp_no, birth_date, gender from employees) t1 join (select emp_no from titles where to_date='9999-01-01') t2 on t1.emp_no=t2.emp_no group by gender, round((to_days('2002-08-01') - to_days(birth_date))/365,0);+--------+------+----------+| gender | ages | count(*) |+--------+------+----------+| M | 38 | 10846 || M | 39 | 11064 || M | 40 | 11135 || M | 41 | 11038 || M | 42 | 10891 || M | 43 | 11261 || M | 44 | 11071 || M | 45 | 11104 || M | 46 | 11033 || M | 47 | 11122 || M | 48 | 11046 || M | 49 | 11051 || M | 50 | 11105 || M | 51 | 347 || F | 38 | 7139 || F | 39 | 7311 || F | 40 | 7329 || F | 41 | 7381 || F | 42 | 7513 || F | 43 | 7465 || F | 44 | 7469 || F | 45 | 7149 || F | 46 | 7342 || F | 47 | 7428 || F | 48 | 7515 || F | 49 | 7325 || F | 50 | 7415 || F | 51 | 229 |+--------+------+----------+28 rows in set (0.86 sec)
岗位涨薪最快
总体每年收入增长
mysql> select round((max-min)/min/(years-1),2) as increaceByYear, count(*) as num from (select emp_no, count(*) as years, max(salary) as max, min(salary) as min from salaries group by emp_no having years>1) tt group by round((max-min)/min/(years-1),2) order by increaceByYear;+----------------+-------+| increaceByYear | num |+----------------+-------+| 0.00 | 1705 || 0.01 | 7925 || 0.02 | 35345 || 0.03 | 68993 || 0.04 | 72082 || 0.05 | 56231 || 0.06 | 31329 || 0.07 | 11807 || 0.08 | 3868 || 0.09 | 1502 || 0.10 | 605 || 0.11 | 223 |+----------------+-------+12 rows in set (1.13 sec)
有岗位升迁的员工, 每年收入增长
mysql> select round((max-min)/min/(years-1),2) as increaceByYear, count(*) as num from ( select t1.emp_no, max(salary) as max, min(salary) as min, count(*) years from (select emp_no, salary from salaries) t1 join (select emp_no, count(*) as tt from titles group by emp_no having tt>1) t2 on t1.emp_no=t2.emp_no group by t1.emp_no having years>1 ) tt group by round((max-min)/min/(years-1),2) order by increaceByYear;+----------------+-------+| increaceByYear | num |+----------------+-------+| 0.00 | 22 || 0.01 | 1295 || 0.02 | 16156 || 0.03 | 35924 || 0.04 | 37833 || 0.05 | 29389 || 0.06 | 14890 || 0.07 | 4040 || 0.08 | 645 || 0.09 | 74 || 0.10 | 2 |+----------------+-------+11 rows in set (1.91 sec)
各部门员工分布
mysql> select dept_no, count(*) from dept_emp group by dept_no;+---------+----------+| dept_no | count(*) |+---------+----------+| d001 | 20211 || d002 | 17346 || d003 | 17786 || d004 | 73485 || d005 | 85707 || d006 | 20117 || d007 | 52245 || d008 | 21126 || d009 | 23580 |+---------+----------+9 rows in set (0.32 sec)
其他:
员工忠诚度
在岗时间在岗时间与岗位\收入相关性