标签:count word Python 09 wordcount Hadoop current output line
1、将测试数据上传到HDFS目录下,这里放到根目录下:/test.txt
2、在master节点中某个目录下:创建mapper、reducer以及run.sh
- mapper.py
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print "%s\t%s" % (word, 1)
- reducer.py
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
try:
count = int(count)
except ValueError:
continue
if current_word == word:
current_count += count
else:
if current_word:
print "%s\t%s" % (current_word, current_count)
current_count = count
current_word = word
if word == current_word:
print "%s\t%s" % (current_word, current_count)
- run.sh
#!/usr/bin/bash
streaming_jar="/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.3.2.jar"
input="/test.txt"
output="/output"
hadoop fs -rmr $output
hadoop jar ${streaming_jar} \
-files mapper.py,reducer.py \
-jobconf mapred.job.priority="VERY_HIGH" \
-jobconf mapred.map.tasks=5 \
-jobconf mapred.job.map.capacity=5 \
-jobconf mapred.job.name="streaming_wordcount" \
-input $input \
-output $output \
-mapper "python mapper.py" \
-reducer "python reducer.py"
if [ $? -ne 0 ];then
echo "streaming_wordcount job failed"
exit 1
fi
3、运行sh run.sh
.....
2022-09-11 03:06:09,869 INFO mapreduce.Job: map 0% reduce 0%
2022-09-11 03:06:15,931 INFO mapreduce.Job: map 14% reduce 0%
2022-09-11 03:06:20,971 INFO mapreduce.Job: map 100% reduce 0%
2022-09-11 03:06:21,979 INFO mapreduce.Job: map 100% reduce 100%
2022-09-11 03:06:21,994 INFO mapreduce.Job: Job job_1662694559814_0004 completed successfully
.....
进入HDFS Web管理界面-Utilities-Browse the file system可以看到词频统计结果已写到HDFS根目录/output中
标签:count,word,Python,09,wordcount,Hadoop,current,output,line 来源: https://www.cnblogs.com/liuliang1999/p/16683721.html
本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享; 2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关; 3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关; 4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除; 5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。