首页 > 其他分享> 文章详细

大数据学习教程SD版第九篇【Flume】

2021-12-26 14:58:16 阅读：187 来源： 互联网

标签：Flume channels sinks D版第九篇 a1 sources k1 c1

Flume 日志采集工具，既然是工具，还是以使用为主！

分布式采集处理和聚合流式框架

通过编写采集方案，即配置文件，来采集数据的工具，配置方案在官方文档

1. Flume 架构

在这里插入图片描述

Agent JVM进程

Source ：接收数据
Channel ：缓冲区
Sink：输出数据

Event 传输单元

2. Flume 安装

Java 和 Hadoop 的环境变量提前配置好，此时解压即用！

3. Flume 官方示例

不同的sink、channel、sink 配置官方文档都有示例

# example.conf : port -> console
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动命令

bin/flume-ng agent -c conf -f jobs/example.conf -n a1 -Dflume.root.logger=INFO,console

传输数据

# yum install -y nc
nc localhost 44444

4. Flume 示例

4.1 File New Context -> HDFS

采集文件新增内容至HDFS，不能断点续传

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /data/test.log

a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/%Y%m%d
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 24
a1.sinks.k1.hdfs.roundUnit = hour
a1.sinks.k1.hdfs.fileType = DataStream

a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动

bin/flume-ng agent -c conf -f jobs/log2hdfs.conf -n a1

4.2 Dir New File -> HDFS

采集目录下新文件到HDFS，不能监控文件内容变化

a1.sources = src-1
a1.sources.src-1.type = spooldir
a1.sources.src-1.channels = c1
a1.sources.src-1.spoolDir = /data/data1
a1.sources.src-1.fileHeader = true

a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%Y-%m-%d/%H
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.fileType = DataStream

启动

bin/flume-ng agent -c conf -f jobs/file2hdfs.conf -n a1

4.3 Dir New FIle And Context -> HDFS

能够监控多目录下文件及文件内容变化至HDFS，能够断点续传，log4j下日志会更名，而文件更名则会重新上传

a1.sources = r1
a1.sources.r1.type = TAILDIR
a1.sources.r1.channels = c1
a1.sources.r1.positionFile = /var/log/flume/taildir_position.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /data/data2/.*file.*
a1.sources.r1.filegroups.f2 = /data/data3/.*log.*
a1.sources.ri.maxBatchCount = 1000

a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events2/%Y-%m-%d/%H
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.fileType = DataStream

启动

bin/flume-ng agent -c conf -f jobs/dir2hdfs.conf -n a1

[{“inode”:786450,“pos”:1501,“file”:"/data/data2/file1.txt"} ] 源码是根据inode和file 共同定位到一个文件

如果处理文件更名的问题，修改 TailFile.java 123 和 ReliableTaildirEventReader.java 256 重新打包，替换libs下的tairdirsource的jar包

5. Flume 事务

Source 推送事件到Channel ，Sink从Channel拉取事件，都是先进临时缓冲区

Source -> Channel doPut putList 回滚是直接清空Channel队列数据，有可能丢数据，有位置记录则不会
Channel -> Sink doTake takeList 回滚是把拉取数据反向写回Channel队列，有可能数据重复

6. Flume Agent 原理

Source 接收数据
Source -> Channel Processor 处理事件
Channel Processor -> Interceptor 事件拦截与过滤
Channel Processor -> Channel Selector : 默认 replicating ，还有 multiplexing
Channel Processor -> Channel n : event 写入channel
Channel -> Sink Processor : 三种：默认 Default 【一个Sink】、LoadBalancing【负载均衡】、Failover【故障转移】
Sink Processor -> Sink : 写入Sink

7. Flume 拓扑结构

借助于 Avro 来连接多个Flume agent

轮询策略：Sink没拉到数据换Sink

简单串联：Sink -> Source
复制和多路复用: 多Channel -> 多Sink
负载均衡和故障转移：Channel -> 多Sink
聚合：多Sink -> Source

8. Flume 自定义Interceptor

自定义Interceptor 实现多路复用：

通过 Header 信息不同进入不同的Channel

采集到包含Error 和Exception 的信息，进入一个Channel，其他进入另一个Channel

各个Channel Sink输出到控制台

编码自定义Interceptor

package com.ipinyou.flume.interceptor;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;

public class TypeInterceptor implements Interceptor {

    private List<Event> eventList;

    @Override
    public void initialize() {
        eventList = new ArrayList<>();
    }

    @Override
    public Event intercept(Event event) {
        Map<String, String> headers = event.getHeaders();
        String body = new String(event.getBody());
        if (body.contains("Error") || body.contains("Exception")) {
            headers.put("type", "error");
        } else {
            headers.put("type", "normal");
        }
        return event;
    }

    @Override
    public List<Event> intercept(List<Event> list) {
        eventList.clear();
        for (Event event : list) {
            eventList.add(intercept(event));
        }
        return eventList;
    }

    @Override
    public void close() {

    }

    public static class Builder implements Interceptor.Builder{

        @Override
        public Interceptor build() {
            return new TypeInterceptor();
        }

        @Override
        public void configure(Context context) {

        }
    }
}

打包上传至Flume的lib目录下
编写采集方案

flume-s1-s2.conf

a1.sources = r1
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop102
a1.sources.r1.port = 6666
a1.sources.r1.channels = c1 c2
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.ipinyou.flume.interceptor.TypeInterceptor$Builder
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = type
a1.sources.r1.selector.mapping.error = c1
a1.sources.r1.selector.mapping.normal = c2

a1.channels = c1 c2
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000
a1.channels.c2.type = memory
a1.channels.c2.capacity = 10000
a1.channels.c2.transactionCapacity = 10000
a1.channels.c2.byteCapacityBufferPercentage = 20
a1.channels.c2.byteCapacity = 800000


a1.sinks = k1 k2
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = hadoop103
a1.sinks.k1.port = 7771
a1.sinks.k2.type = avro
a1.sinks.k2.channel = c2
a1.sinks.k2.hostname = hadoop104
a1.sinks.k2.port = 7772

flume-console1.conf

a1.sources = r1
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = hadoop103
a1.sources.r1.port = 7771

a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

a1.sinks = k1
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1

flume-console2.conf

a1.sources = r1
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = hadoop104
a1.sources.r1.port = 7772

a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

a1.sinks = k1
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1

启动

# 依次启动在: hadoop103 hadoop104 hadoop102
bin/flume-ng agent -c conf -f jobs/flume-console1.conf -n a1 -Dflume.root.logger=INFO,console
bin/flume-ng agent -c conf -f jobs/flume-console2.conf -n a1 -Dflume.root.logger=INFO,console
bin/flume-ng agent -c conf -f jobs/dir2hdfs.conf -n a1

9. Flume 自定义Source

编码实现

自定义类继承 AbstractSource ,实现 Configurable, PollableSource
实现 configure（）：读取配置文件
实现 process（）：接收外部数据，封装Event，写入Channel

打包到lib下
编写配置文件

source type : 全类名

启动

10. Flume 自定义Sink

编码实现

自定义类继承 AbstractSink，实现Configurable
实现 configure（）：读取配置文件
实现 process（）：接收Channel数据，开启事物，写入对应位置

后续和上述一致

11. Flume 监控

借助 Ganglia 第三方开源工具

Ganglia：web 展示数据、gmetad 存储数据、gmod 收集数据

11.1 Ganglia 安装

安装

# 102 103 104
yum install -y epel-release
# 102
yum install -y ganglia-gmetad
yum install -y ganglia-web
yum install -y ganglia-gmod
# 103 104
yum install -y ganglia-gmod

修改配置文件

/etc/httpd/conf.d/ganglia.conf

# 在 Location 下 配置WindowsIP
Require ip 192.168.xxx.xxx

/etc/ganglia/gmetad.conf

data_source "my cluster" hadoop102

/etc/ganglia/gmod.conf : hadoop102 103 104 分发

# 修改下列配置
name = "my cluster"
host = hadoop102
bind = 0.0.0.0

关闭 selinux： /etc/selinux/config ，重启才能生效或临时生效

SELINUX=disabled

# 临时生效
setenforce 0

11.2 Ganlia 启动

# 如果权限不足，则修改权限
chmod -R 777 /var/lib/ganglia
# hadoop102
systemctl start gmond
systemctl start httpd
systemctl start gmetad

# hadoop103 hadoop104
systemctl start gmond

浏览器打开Web UI:

http://hadoop102/ganglia

11.3 Flume 启动

bin/flume-ng agent -n a1 -c conf -f jobs/xxx
-Dflume.root.logger=INFO,console
-Dflume.monitoring.type=ganglia
-Dflume.monitoring.hosts=hadoop102:8649

标签：Flume,channels,sinks,D版,第九篇,a1,sources,k1,c1
来源： https://blog.csdn.net/qq_41200768/article/details/122155427

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9