首页 > 数据库> 文章详细

[ Practice ] 日志分析：某答题系统（SQL）

2021-02-12 16:31:03 阅读：243 来源： 互联网

标签：exam 答题 Practice question ex student SQL id accuracy

某答题系统日志分析

一、数据描述
二、准备工作
- 1.准备数据
- 2.准备环境
三、功能需求
四、查询工作
- 1.使用 ex_exam_record 表中的数据统计每个学员总分、答题的试题数和正确率，并保存到 ex_exam_anlysis 表中。
- 2.使用 ex_exam_record 表中的数据统计每个作对，做错，半对的题目列表。

一、数据描述

这是一份来自于某在线考试系统的学员答题批改日志，日志中记录了日志生成时间，题目，难度系数，题目所属的知识点 ID，做题的学生 ID，题目 ID 以及作答批改结果。日志的结构如下：

在这里插入图片描述

二、准备工作

1.准备数据

answer_question.log	提取码：ezy6

2.准备环境

#启动服务：
start-dfs.sh

start-yarn.sh

nohup hive –service metastore>/dev/null 2>&1 &
nohup hive –service hiveserver2>/dev/null 2>&1 &

zkServer.sh start

start-hbase.sh

三、功能需求

1.上传数据至HDFS

在 HDFS 中创建目录 /app/data/exam，并将 answer_question.log 传到该目录。
提示：HDFS 创建目录时要一层层的创建

hdfs dfs -mkdir /app
hdfs dfs -mkdir /app/data
hdfs dfs -mkdir /app/data/exam
hdfs dfs -put answer_question.log /app/data/exam

2.使用 SQL 完成分析

①提取日志中的知识点 ID，学生 ID，题目 ID，作答结果 4 个字段的值。

详细步骤：

在 Hive 中创建临时表 ex_exam7_tmp

create database exam;
use exam;
create external table ex_exam7_tmp(
req string,
resp string
)
row format delimited
fields terminated by ','
location '/app/data/exam'

在 Hive 中创建考试记录表 ex_exam_record ，表结构如下：

字段名称	字段类型	字段含义
topic_id	string	知识点ID
student_id	string	学生ID
question_id	string	题目ID
score	float	作答结果

create external table ex_exam_record(
topic_id string,
student_id string,
question_id string,
score float
)
row format delimited
fields terminated by ','

向 ex_exam_record 表中插入数据

insert into ex_exam_record(topic_id,student_id,question_id,score)
select * from(
  select
    regexp_extract(req,'.*?_(.*?)_(.*?)_(.*?)r.*?',1) topic_id,
    regexp_extract(req,'.*?_(.*?)_(.*?)_(.*?)r.*?',2) student_id,
    regexp_extract(req,'.*?_(.*?)_(.*?)_(.*?)r.*?',3) question_id,
    cast(regexp_extract(req,'.*?_(.*?)_(.*?)_(.*?)r (\\d(.\\d)?)',4) as float) score
  from ex_exam7_tmp
)T

②将提取后的结果保存到 HDFS 的 /app/data/result 目录下。

sqoop import \
--connect jdbc:hive2://single:10000/exam \
--driver org.apache.hive.jdbc.HiveDriver \
--table ex_exam_record \
--username root \
--password 123456 \
--target-dir /exam/data/result \
-m 3

3.创建HBase数据表

在 HBase 中创建命名空间 (namespace) exam ，在该命名空间下创建 analysis 表，使用 学生 ID 作为 RowKey，该表下有 2 个列族 accuracy、question 。

accuracy 列族	question 列族
总分 accuracy:total_score	正确 question:right
答题的试题数 accuracy:question_count	错误 question:error
正确率 accuracy:accuracy	半对 question:half

创建命名空间

create_namespace 'exam'

创建表 analysis

create 'exam:analysis','accuracy','question'

查看命名空间 exam 中的表

list_namespace_tables 'exam'

4.创建Hive数据表

在 Hive 中创建数据库 exam，在该数据库中创建外部表 ex_exam_record 指向 /app/data/result 下 Spark 处理后的日志数据 ;创建外部表 ex_exam_anlysis 映射至 HBase 中的 analysis 表的 accuracy 列族;创建外部表 ex_exam_question 映射至 HBase 中的 analysis 表的 question 列族

ex_exam_anlysis 表结构如下：

字段名称	字段类型	字段含义
student_id	string	学生ID
total_score	float	总分
question_count	int	答题的试题数
accuracy	float	正确率

create external table if not exists ex_exam_anlysis(
student_id string,
total_score float,
question_count int,
accuracy float
)
stored by 
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
with serdeproperties ("hbase.columns.mapping"=
":key,accuracy:total_score,accuracy:question_count,accuracy:accuracy")
tblproperties("hbase.table.name"="exam:analysis");

ex_exam_question 表结构如下：

字段名称	字段类型	字段含义
student_id	string	学生ID
right	string	所有做对的题目的ID列表
half	string	所有半对的题目的ID列表
error	string	所有做错的题目的ID列表

create external table if not exists ex_exam_question(
student_id string,
right string,
half string,
error string
)
stored by 
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
with serdeproperties  ("hbase.columns.mapping"=
":key,question:right,question:half,question:error")
tblproperties("hbase.table.name"="exam:analysis");

四、查询工作

1.使用 ex_exam_record 表中的数据统计每个学员总分、答题的试题数和正确率，并保存到 ex_exam_anlysis 表中。

提示：正确率=总分/答题的试题数

查询并插入数据至表 ex_exam_anlysis

insert into ex_exam_anlysis 
select student_id, sum(score) as total_score, count(1) as question_count, 
sum(score)/count(1) as accuracy 
from ex_exam_record group by student_id;

查询 ex_exam_anlysis 表中是否有数据

select * from ex_exam_anlysis;

在 hbase 中查询数据(列簇 accuracy )

scan 'exam:analysis',{COLUMNS=>'accuracy'}

2.使用 ex_exam_record 表中的数据统计每个作对，做错，半对的题目列表。

①题目 id 以逗号分割，并保存到 ex_exam_question 表中。

插入数据：

insert into ex_exam_question(student_id,error,half,right)
select student_id,concat_ws(',',error) error,concat_ws(',',half) half,concat_ws(',',right) right
from(
  select student_id,info error,
  lead(info,1,NULL) over(distribute by student_id sort by score) as half,
  lead(info,2,NULL) over(distribute by student_id sort by score) as right,
  row_number() over(distribute by student_id sort by score) as rn
  from(
    select student_id,score,collect_set(question_id) info
    from ex_exam_record
    group by student_id,score
  )T
)T
where rn=1

②完成统计后，在 HBase Shell 中遍历 exam:analysis 表并只显示 question 列族中的数据。

提示：由于 hive 中外部表 ex_exam_question 已经映射至 hbase:exam:analysis 表 question 列簇，这里直接在 habse shell 命令行查询即可

scan 'exam:analysis',{COLUMNS=>'question'}