标签:01 Hive 语法 登陆 date login id select user
文章目录
Sql方式实现连续N天登陆
构造测试数据
create table dwd.login_log as
select 1 as user_id, "2020-01-01" as login_date
union all
select 1 as user_id, "2020-01-02" as login_date
union all
select 1 as user_id, "2020-01-07" as login_date
union all
select 1 as user_id, "2020-01-08" as login_date
union all
select 1 as user_id, "2020-01-09" as login_date
union all
select 1 as user_id, "2020-01-10" as login_date
union all
select 2 as user_id, "2020-01-01" as login_date
union all
select 2 as user_id, "2020-01-02" as login_date
union all
select 2 as user_id, "2020-01-04" as login_date
如果日期格式不规范,可以将其转换为标准格式
create table dwd.login_log as
select user_id,to_date(from_unixtime(UNIX_TIMESTAMP(login_date,'yyyy-MM-dd'))) as login_date
from tmp.login_log; -- tmp库为原始数据
1.使用lag&lead+datediff窗口函数
- 比如求连续三天登陆,可以将当天上一条数据和下一条数据都拿到,然后保证now-lag=lead-now=1即可;
- 如果是连续多天,可以取更多的数据,或者将数据全部更改为lag或者lead函数;
datediff(date1, date2) - Returns the number of days between date1 and date2
select user_id
from
(select user_id
from
(select user_id,
lag(login_date,1) over(partition by user_id order by login_date) as lag_login_date,
login_date,
lead(login_date,1) over(partition by user_id order by login_date) as lead_login_date
from dwd.login_log)t1
where datediff(login_date,lag_login_date)=1 and datediff(lead_login_date,login_date)=1)t2
group by user_id;
2.使用date_add函数
- 通用的,先对user_id分区排序,然后将日期减去rank天,查看有多少条数据即可;
- 优点在于可以统计具体连续登陆多少天,以及连续登陆的实际情况;
date_add(start_date, num_days) - Returns the date that is num_days after start_date
select user_id,con_login_date,count(*) nums
from
(select user_id,login_date,rk,date_add(login_date,1 - rk) as con_login_date
from
(select user_id,login_date,rank() over(partition by user_id order by login_date) rk
from dwd.login_log)t1
)t2
group by user_id,con_login_date
having count(*) >= 3;
- t1表的查询结果
用户id | 登陆时间 | 按照登陆时间组内排序 |
---|---|---|
1 | 2020-01-01 | 1 |
1 | 2020-01-02 | 2 |
1 | 2020-01-07 | 3 |
1 | 2020-01-08 | 4 |
1 | 2020-01-09 | 5 |
1 | 2020-01-10 | 6 |
2 | 2020-01-01 | 1 |
2 | 2020-01-02 | 2 |
2 | 2020-01-04 | 3 |
- t2表的查询结果,归一化的日期(也就是上述取前
1 - rk
)可以自己定义
用户id | 登陆时间 | 连续登陆的日期归一化的日期 |
---|---|---|
1 | 2020-01-01 | 2020-01-01 |
1 | 2020-01-02 | 2020-01-01 |
1 | 2020-01-07 | 2020-01-05 |
1 | 2020-01-08 | 2020-01-05 |
1 | 2020-01-09 | 2020-01-05 |
1 | 2020-01-10 | 2020-01-05 |
2 | 2020-01-01 | 2020-01-01 |
2 | 2020-01-02 | 2020-01-01 |
2 | 2020-01-04 | 2020-01-02 |
- group by后的查询结果,第三列可以按照session内统计来理解,就是这批连续登陆内连续登陆的天数
用户id | 连续登陆的日期归一化的日期 | 用户此次连续登陆天数 |
---|---|---|
1 | 2020-01-01 | 2 |
1 | 2020-01-05 | 4 |
2 | 2020-01-01 | 2 |
2 | 2020-01-02 | 1 |
代码实现思路
- 使用代码来实现连续N天登陆,核心逻辑就是
按照日期排序,新日期如果和旧日期相差1天就保留在HashMap里面,Size超过N即可输出user_id,否则清空
package cn.lang.spark_core
import java.text.{ParseException, SimpleDateFormat}
import java.util.Calendar
import org.apache.spark.sql.SparkSession
object ContinuousLoginDays {
def main(args: Array[String]): Unit = {
// env
val spark: SparkSession = SparkSession
.builder()
.appName("ContinuousLoginDays")
.master("local[*]")
.getOrCreate()
val sc = spark.sparkContext
// source,可以是load hive(开启hive支持)或者parquet列式文件(定义好schema)
val source = sc.textFile("/user/hive/warehouse/dwd/login_log")
case class Login(uid: Int, loginTime: String) // 可以kryo序列化
/** get date last `abs(n)` days defore or after biz_date *
* example biz_date = 20200101 ,last_n = 1,return 20191231 */
def getLastNDate(biz_date: String,
date_format: String = "yyyyMMdd",
last_n: Int = 1): String = {
val calendar: Calendar = Calendar.getInstance()
val sdf = new SimpleDateFormat(date_format)
try
calendar.setTime(sdf.parse(biz_date))
catch {
case e: ParseException => // omit
}
calendar.set(Calendar.DATE, calendar.get(Calendar.DATE) - last_n)
sdf.format(calendar.getTime)
}
// transform
val result = source
.map(_.split("\t"))
.map(iterm => Login(iterm(0).toInt, iterm(1)))
.groupBy(_.uid) // RDD[(Int, Iterable[Login])]
.map(iterm => {
// 用于给此uid标记是否符合要求
var CONTINUOUS_LOGIN_N = false
val logins = iterm._2
.toSeq
.sortWith((v1, v2) => v1.loginTime.compareTo(v2.loginTime) > 0)
var lastLoginTime: String = ""
var loginDays: Int = 0
logins
.foreach(iterm => {
if (lastLoginTime == "") {
lastLoginTime = iterm.loginTime
loginDays = 1
} else if (getLastNDate(iterm.loginTime) == lastLoginTime) {
lastLoginTime = iterm.loginTime
loginDays = 2
} else {
lastLoginTime = iterm.loginTime
loginDays = 1
}
})
if (loginDays > 3) CONTINUOUS_LOGIN_N = true
/** 此处可以使用集合将连续登陆的情况保留,
* 也可以直接按照是否连续登陆N天进行标记
*/
(iterm._1, CONTINUOUS_LOGIN_N)
})
.filter(_._2)
.map(_._1)
// sink
result.foreach(println(_))
}
}
标签:01,Hive,语法,登陆,date,login,id,select,user 来源: https://blog.csdn.net/Eden_lang/article/details/112438884
本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享; 2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关; 3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关; 4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除; 5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。