ICode9

精准搜索请尝试: 精确搜索
首页 > 其他分享> 文章详细

知行教育_访问咨询主题-增量采集

2021-06-21 11:05:15  阅读:166  来源: 互联网

标签:dayinfo quarterinfo itcast 采集 time monthinfo 访问 yearinfo 增量


文章目录

访问咨询主题-增量采集

注意点:
业务系统中,数据表的后缀是 年_月 ,这就表明随着时间推移,被
采集的表的后缀是动态变化的。
我们的脚本也要做到这一点
这个功能要做成自动化的脚本,脚本能够每天定时执行

增量采集,一天执行一次
当天采集昨日数据

业务数据库 - > ODS的流程

业务系统中,数据表的后缀是 年_月 ,这就表明随着时间推移,被
采集的表的后缀是动态变化的。
我们的增量采集是按天来的,采集的数据能够带上按天过滤的条件

#!/bin/bash

# 需求:能够手动传入参数指定采集的日期。  如果不传入参数要求能够自动的识别前一天的日期
if [ $# -eq 0 ]
then
    DATESTR=`date -d "-1 day" +%Y-%m-%d`
else
    DATESTR=$1
fi

# Define var
SQOOP_HOME=/export/servers/sqoop
JDBCSTR="jdbc:mysql://192.168.10.10:3306/nev"
MYSQL_USERNAME=root
YEARSTR=`date -d "-1 day" +%Y`
MONTHSTR=`date -d "-1 day" +%m`
HIVE_DB=itcast_ods
MAP_NUMBER=3


echo "执行采集web_chat_ems_${YEARSTR}_${MONTHSTR}的任务......"

$SQOOP_HOME import \
--connect $JDBCSTR \
--username $MYSQL_USERNAME --password $MYSQL_PASSWORD \
--query "SELECT
  id,create_date_time,session_id,sid,create_time,
  seo_source,seo_keywords,ip,area,country,province,
  city,origin_channel,user AS user_match,manual_time,
  begin_time,end_time,last_customer_msg_time_stamp,
  last_agent_msg_time_stamp,reply_msg_count,msg_count,
  browser_name,os_info, CURRENT_DATE() AS start_time
FROM  web_chat_ems_${YEARSTR}_${MONTHSTR}
WHERE create_time >= '${DATESTR} 00:00:00' AND create_time <= '${DATESTR} 23:59:59' and \$CONDITIONS" \
--fields-terminated-by '\t' \
--hcatalog-database $HIVE_DB \
--hcatalog-table web_chat_ems \
-m $MAP_NUMBER \
--split-by id



wait

echo "执行采集web_chat_text_ems_${YEARSTR}_${MONTHSTR}的任务......"
$SQOOP_HOME import \
--connect $JDBCSTR \
--username $MYSQL_USERNAME --password $MYSQL_PASSWORD \
--query "SELECT
id,referrer,from_url,landing_page_url,
url_title,platform_description,
other_params,history, CURRENT_DATE() AS start_time
FROM web_chat_text_ems_${YEARSTR}_${MONTHSTR}
WHERE 1=1 AND \$CONDITIONS" \
--fields-terminated-by '\t' \
--hcatalog-database $HIVE_DB \
--hcatalog-table web_chat_text_ems \
-m $MAP_NUMBER \
--split-by id

配置ODS -> DWD的流程

# 配置从ODS层到DWD层的脚本

# 需求:能够手动传入参数指定采集的日期。  如果不传入参数要求能够自动的识别前一天的日期
if [ $# -eq 0 ]
then
    DATESTR=`date -d "-1 day" +%Y-%m-%d`
else
    DATESTR=$1
fi

HIVE_HOME="/usr/bin/hive"
YEARSTR=`date -d "${DATESTR}" +%Y`
MONTHSTR=`date -d "${DATESTR}" +%m`
DAYSTR=`date -d "${DATESTR}" +%d`

# 计算季度
if [ ${MONTHSTR} -ge 1 ] && [ ${MONTHSTR} -le 3 ]
then
    QUARTERSTR=1
elif [ ${MONTHSTR} -ge 4 ] && [ ${MONTHSTR} -le 6 ]
then
    QUARTERSTR=2
elif [ ${MONTHSTR} -ge 7 ] && [ ${MONTHSTR} -le 9 ]
then
    QUARTERSTR=3
else
    QUARTERSTR=4
fi


echo "即将执行:${DATESTR} 这一天的ODS -> DWD的过程......"

$HIVE_HOME -e "
-- 动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
-- hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
-- 写入时ORC压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
INSERT INTO TABLE itcast_dwd.visit_consult_dwd partition(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    w1.session_id,
    w1.sid,
    UNIX_TIMESTAMP(w1.create_time, 'yyyy-MM-dd HH:mm:ss') AS create_time,
    w1.seo_source,
    w1.ip,
    w1.area,
    w1.country,
    w1.province,
    w1.city,
    CAST(IF (w1.msg_count IS NULL, 0, w1.msg_count) AS INT) AS msg_count,
    w1.origin_channel,
    w2.referrer,
    w2.from_url,
    w2.landing_page_url,
    w2.url_title,
    w2.platform_description,
    w2.other_params,
    w2.history,
    SUBSTRING(w1.create_time, 12, 2) AS hourinfo,
    SUBSTRING(w1.create_time, 1, 4) AS yearinfo,
    QUARTER(w1.create_time) AS quarterinfo,
    SUBSTRING(w1.create_time, 6, 2) AS monthinfo,
    SUBSTRING(w1.create_time, 9, 2) AS dayinfo
FROM (SELECT * FROM itcast_ods.web_chat_ems WHERE start_time='${DATESTR}') AS w1 INNER JOIN itcast_ods.web_chat_text_ems AS w2
ON w1.id = w2.id;"

DWD -> DWS 流程

问题:分析针对增量数据的不同做法以及优缺点

我们DWS中是包含了年、季度、月等维度的结果数据

由于新增了一天的数据,当前年、当前季度、当前月的结果数据就失效了

需要重新计算

问题在于:DWS表用的失效数据如何处理。

问题解决方式

方式1

删除过期数据

优点:

  • 对BI友好,没有历史数据的混淆,直接取就是最新的
  • 表中的数据清晰

缺点:

  • 执行复杂
  • 破坏了数仓设计中尽量不删除的原则

方式2:

增加新列,表名当前数据计算的时间

用的时候,取最新时间的即可

优点:

  • 表中存储了历史结果的变更
  • 不会执行删除,不破坏数仓的原则

缺点:

  • 对BI分析来说,需要先过滤最新的数据(稍微不友好)
  • 修改表结构(全量操作要重来)

方式3:

增加新表

每一天的结果都生成一个表(一天一个表)

优点:

  • 每个表都是清晰的,指向了具体某天的结果
  • 通过多个表的方式也记录的历史结果的变更

缺点:

  • 数据冗余过多(只要业务需要,冗余不是事)
  • 对BI不友好(换一天,换一个表,如果BI不支持动态规则配置自动变更表,就要手动变更)

这三种方式,没有绝对的谁好谁坏

合适的需求下,选择合适的方式即可。

过期数据的删除

# 删除过期年的数据
ALTER TABLE itcast_dws.visit_dws DROP PARTITION(yearinfo='当前年', quarterinfo='-1', monthinfo='-1', dayinfo='-1');
# 删除过期季度的数据
ALTER TABLE itcast_dws.visit_dws DROP PARTITION(yearinfo='当前年', quarterinfo='当前季度', monthinfo='-1', dayinfo='-1');
# 删除过期月的数据
ALTER TABLE itcast_dws.visit_dws DROP PARTITION(yearinfo='当前年', quarterinfo='当前季度', monthinfo='当前月', dayinfo='-1');

示例脚本

# DWD层数据聚合到DWS层的操作脚本(增量)



# 需求:能够手动传入参数指定采集的日期。  如果不传入参数要求能够自动的识别前一天的日期
if [ $# -eq 0 ]
then
    DATESTR=`date -d "-1 day" +%Y-%m-%d`
else
    DATESTR=$1
fi

HIVE_HOME="/usr/bin/hive"
YEARSTR=`date -d "${DATESTR}" +%Y`
MONTHSTR=`date -d "${DATESTR}" +%m`
DAYSTR=`date -d "${DATESTR}" +%d`

# 计算季度
if [ ${MONTHSTR} -ge 1 ] && [ ${MONTHSTR} -le 3 ]
then
    QUARTERSTR=1
elif [ ${MONTHSTR} -ge 4 ] && [ ${MONTHSTR} -le 6 ]
then
    QUARTERSTR=2
elif [ ${MONTHSTR} -ge 7 ] && [ ${MONTHSTR} -le 9 ]
then
    QUARTERSTR=3
else
    QUARTERSTR=4
fi


echo "先删除过期的${YEARSTR}年, ${QUARTERSTR}季度, ${MONTHSTR}月数据"
$HIVE_HOME -e "
# 删除过期年的数据
ALTER TABLE itcast_dws.visit_dws DROP PARTITION(yearinfo='${YEARSTR}', quarterinfo='-1', monthinfo='-1', dayinfo='-1');
# 删除过期季度的数据
ALTER TABLE itcast_dws.visit_dws DROP PARTITION(yearinfo='${YEARSTR}', quarterinfo='${QUARTERSTR}', monthinfo='-1', dayinfo='-1');
# 删除过期月的数据
ALTER TABLE itcast_dws.visit_dws DROP PARTITION(yearinfo='${YEARSTR}', quarterinfo='${QUARTERSTR}', monthinfo='${MONTHSTR}', dayinfo='-1');"


echo "执行新数据的计算,计算的日期是:${DATESTR}"
$HIVE_HOME -e "
/*
指标:访问量
维度:时间(年、季度、月、天、小时)、来源渠道、受访页面、搜索渠道、区域维度
*/
-- 时间维度表开发
-- 动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
-- hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
-- 写入时ORC压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
-- 统计某年的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    COUNT(DISTINCT sid) AS sid_total,
    COUNT(DISTINCT session_id) AS sessionid_total,
    COUNT(DISTINCT ip) AS ip_total,
    '-1' AS country,
    '-1' AS province,
    '-1' AS city,
    '-1' AS seo_source,
    '-1' AS origin_channel,
    '-1' AS hourinfo,
    yearinfo AS time_str,
    '-1' AS  from_url,
    '5' AS groupType,
    '5' AS time_type,
    yearinfo,
    '-1' AS quarterinfo,
    '-1' AS monthinfo,
    '-1' AS dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}'
GROUP BY yearinfo;
-- 统计某年某季度的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    COUNT(DISTINCT sid) AS sid_total,
    COUNT(DISTINCT session_id) AS sessionid_total,
    COUNT(DISTINCT ip) AS ip_total,
    '-1' AS country,
    '-1' AS province,
    '-1' AS city,
    '-1' AS seo_source,
    '-1' AS origin_channel,
    '-1' AS hourinfo,
    CONCAT(yearinfo, '-Q', quarterinfo) AS time_str,
    '-1' AS  from_url,
    '5' AS groupType,
    '4' AS time_type,
    yearinfo,
    quarterinfo,
    '-1' AS monthinfo,
    '-1' AS dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}'
GROUP BY yearinfo, quarterinfo;
-- 统计某年某季度某月的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    COUNT(DISTINCT sid) AS sid_total,
    COUNT(DISTINCT session_id) AS sessionid_total,
    COUNT(DISTINCT ip) AS ip_total,
    '-1' AS country,
    '-1' AS province,
    '-1' AS city,
    '-1' AS seo_source,
    '-1' AS origin_channel,
    '-1' AS hourinfo,
    CONCAT(yearinfo, '-', monthinfo) AS time_str,
    '-1' AS  from_url,
    '5' AS groupType,
    '3' AS time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    '-1' AS dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}'
GROUP BY yearinfo, quarterinfo, monthinfo;
-- 统计某年某季度某月某日的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    COUNT(DISTINCT sid) AS sid_total,
    COUNT(DISTINCT session_id) AS sessionid_total,
    COUNT(DISTINCT ip) AS ip_total,
    '-1' AS country,
    '-1' AS province,
    '-1' AS city,
    '-1' AS seo_source,
    '-1' AS origin_channel,
    '-1' AS hourinfo,
    CONCAT(yearinfo, '-', monthinfo, '-', dayinfo) AS time_str,
    '-1' AS  from_url,
    '5' AS groupType,
    '2' AS time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}' AND dayinfo = '${DAYSTR}'
GROUP BY yearinfo, quarterinfo, monthinfo, dayinfo;
-- 统计某年某季度某月某日某小时的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    COUNT(DISTINCT sid) AS sid_total,
    COUNT(DISTINCT session_id) AS sessionid_total,
    COUNT(DISTINCT ip) AS ip_total,
    '-1' AS country,
    '-1' AS province,
    '-1' AS city,
    '-1' AS seo_source,
    '-1' AS origin_channel,
    hourinfo,
    CONCAT(yearinfo, '-', monthinfo, '-', dayinfo, ' ', hourinfo) AS time_str,
    '-1' AS  from_url,
    '5' AS groupType,
    '1' AS time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}' AND dayinfo = '${DAYSTR}'
GROUP BY yearinfo, quarterinfo, monthinfo, dayinfo, hourinfo;




-- 按照时间和其它维度汇合进行聚合
-- 按照时间和国家进行组合计算
-- 动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
-- hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
-- 写入时ORC压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
-- 统计某年某国家的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    COUNT(DISTINCT sid) AS sid_total,
    COUNT(DISTINCT session_id) AS sessionid_total,
    COUNT(DISTINCT ip) AS ip_total,
    country,
    '-1' AS province,
    '-1' AS city,
    '-1' AS seo_source,
    '-1' AS origin_channel,
    '-1' AS hourinfo,
    yearinfo AS time_str,
    '-1' AS  from_url,
    '1' AS groupType,
    '5' AS time_type,
    yearinfo,
    '-1' AS quarterinfo,
    '-1' AS monthinfo,
    '-1' AS dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}'
GROUP BY yearinfo,country;
-- 统计某年某季度某国家的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    COUNT(DISTINCT sid) AS sid_total,
    COUNT(DISTINCT session_id) AS sessionid_total,
    COUNT(DISTINCT ip) AS ip_total,
    country,
    '-1' AS province,
    '-1' AS city,
    '-1' AS seo_source,
    '-1' AS origin_channel,
    '-1' AS hourinfo,
    CONCAT(yearinfo, '-Q', quarterinfo) AS time_str,
    '-1' AS  from_url,
    '1' AS groupType,
    '4' AS time_type,
    yearinfo,
    quarterinfo,
    '-1' AS monthinfo,
    '-1' AS dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}'
GROUP BY yearinfo, quarterinfo, country;
-- 统计某年某季度某月某国家的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    COUNT(DISTINCT sid) AS sid_total,
    COUNT(DISTINCT session_id) AS sessionid_total,
    COUNT(DISTINCT ip) AS ip_total,
    country,
    '-1' AS province,
    '-1' AS city,
    '-1' AS seo_source,
    '-1' AS origin_channel,
    '-1' AS hourinfo,
    CONCAT(yearinfo, '-', monthinfo) AS time_str,
    '-1' AS  from_url,
    '1' AS groupType,
    '3' AS time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    '-1' AS dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}'
GROUP BY yearinfo, quarterinfo, monthinfo, country;
-- 统计某年某季度某月某天某国家的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    COUNT(DISTINCT sid) AS sid_total,
    COUNT(DISTINCT session_id) AS sessionid_total,
    COUNT(DISTINCT ip) AS ip_total,
    country,
    '-1' AS province,
    '-1' AS city,
    '-1' AS seo_source,
    '-1' AS origin_channel,
    '-1' AS hourinfo,
    CONCAT(yearinfo, '-', monthinfo, '-', dayinfo) AS time_str,
    '-1' AS  from_url,
    '1' AS groupType,
    '2' AS time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}' AND dayinfo = '${DAYSTR}'
GROUP BY yearinfo, quarterinfo, monthinfo, dayinfo, country;
-- 统计某年某季度某月某天某小时某国家的访问量
INSERT INTO TABLE itcast_dws.visit_dws PARTITION(yearinfo, quarterinfo, monthinfo, dayinfo)
SELECT
    COUNT(DISTINCT sid) AS sid_total,
    COUNT(DISTINCT session_id) AS sessionid_total,
    COUNT(DISTINCT ip) AS ip_total,
    country,
    '-1' AS province,
    '-1' AS city,
    '-1' AS seo_source,
    '-1' AS origin_channel,
    hourinfo,
    CONCAT(yearinfo, '-', monthinfo, '-', dayinfo, ' ', hourinfo) AS time_str,
    '-1' AS  from_url,
    '1' AS groupType,
    '1' AS time_type,
    yearinfo,
    quarterinfo,
    monthinfo,
    dayinfo
FROM itcast_dwd.visit_consult_dwd
WHERE yearinfo = '${YEARSTR}' AND quarterinfo = '${QUARTERSTR}' AND monthinfo = '${MONTHSTR}' AND dayinfo = '${DAYSTR}'
GROUP BY yearinfo, quarterinfo, monthinfo, dayinfo, hourinfo, country;
"

DWS -> MySQL

如果DWS用的是方式1,MySQL中也要删除过期

导出脚本执行之前需要在MySQL中删除即可

mysql -uroot -p123456 scrm_bi -e "
# 删除过期年
delete from itcast_visit WHERE yearinfo='${YEARSTR}' AND quarterinfo='-1' AND monthinfo='-1' AND dayinfo='-1';
# 删除过期季度
delete from itcast_visit WHERE yearinfo='${YEARSTR}' AND quarterinfo='${QUARTERSTR}' AND monthinfo='-1' AND dayinfo='-1';
# 删除过期月
delete from itcast_visit WHERE yearinfo='${YEARSTR}' AND quarterinfo='${QUARTERSTR}' AND monthinfo='${MONTHSTR}' AND dayinfo='-1';
"

sqoop export \
--connect "jdbc:mysql://192.168.52.150:3306/scrm_bi?useUnicode=true&characterEncoding=utf-8" \
--username root --password 123456 \
--table itcast_visit \
--hcatalog-database itcast_dws \
--hcatalog-table visit_dws \
--hcatalog-partition-keys yearinfo \
--hcatalog-partition-values ${YEARSTR} \
-m 1

标签:dayinfo,quarterinfo,itcast,采集,time,monthinfo,访问,yearinfo,增量
来源: https://blog.51cto.com/u_15277063/2929914

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有