统计序列中元素出现的频度并获取topK

2021-08-23 02:00:45 阅读：242 来源： 互联网

标签：20 randint Counter print topK 频度序列 import

将序列转换为计数字典{元素: 频度}，然后根据频度排序。

1、使用 dict.fromkeys() 构造计数字典

from random import randint

# 创建一个随机列表
L = [randint(0, 20) for _ in range(30)]
print(L)

# 创建一个所有key初始值为0的字典
d = dict.fromkeys(L, 0)
print(d)
# {20: 0, 3: 0, 9: 0, 7: 0, 6: 0, 14: 0, 8: 0, 19: 0, 15: 0, 18: 0, 12: 0, 4: 0, 17: 0, 5: 0, 1: 0, 0: 0, 2: 0}

# 统计频度
for i in L:
    d[i] += 1

print(d)
# {20: 2, 3: 3, 9: 2, 7: 3, 6: 1, 14: 2, 8: 2, 19: 2, 15: 2, 18: 1, 12: 1, 4: 2, 17: 2, 5: 2, 1: 1, 0: 1, 2: 1}

2、使用 dict.setdefault() 构造计数字典

from random import randint

L = [randint(0, 20) for _ in range(30)]

d = {}
for i in L:
    d[i] = d.setdefault(i, 0) + 1

print(d)
# {13: 2, 14: 2, 9: 2, 8: 3, 1: 2, 5: 2, 7: 1, 20: 2, 10: 3, 0: 2, 18: 1, 4: 1, 3: 1, 17: 2, 16: 1, 12: 2, 11: 1}

3、使用 heapq.nlargest(n, iterable, key=None) 进行频度统计

Equivalent to: sorted(iterable, key=key, reverse=True)[:n]

from random import randint
import heapq

# 根据频度进行排序，并取出排名前3个
L = [randint(0, 20) for _ in range(30)]
d = {}
for i in L:
    d[i] = d.setdefault(i, 0) + 1

s = sorted(d.items(), key=lambda x: x[1], reverse=True)[:3]
print(s)
# [(16, 4), (19, 3), (13, 3)]

# 使用堆，取出排名前3个
r = heapq.nlargest(3, d.items(), key=lambda x: x[1])
print(r)
# [(16, 4), (19, 3), (13, 3)]

4、使用 Counter 进行频度统计

一个 Counter 是一个 dict 的子类，用于计数可哈希对象。
元素像字典键(key)一样存储，它们的计数存储为值。
默认是降序。

这个算是频度统计最简单的姿势了，无需手动构造计数字典，可以直接操作一个可迭代对象。

from collections import Counter

c1 = Counter()                           # a new, empty counter
c2 = Counter('gallahad')                 # a new counter from an iterable
c3 = Counter({'red': 4, 'blue': 2})      # a new counter from a mapping
c4 = Counter(cats=4, dogs=8)             # a new counter from keyword args

from random import randint
from collections import Counter

L = [randint(0, 20) for _ in range(30)]
c = Counter(L)
print(c)
# Counter({16: 4, 19: 3, 13: 3, 3: 3, 1: 2, 18: 2, 14: 2, 10: 2, 9: 2, 4: 2, 7: 1, 20: 1, 15: 1, 5: 1, 8: 1})

# 使用most_common()方法获取topN，这里其实是基于heapq实现的
r = c.most_common(3)
print(r)
# [(16, 4), (19, 3), (13, 3)]

# 更新Counter，合并统计
c2 = Counter(L)
c.update(c2)
print(c)
# Counter({16: 8, 19: 6, 13: 6, 3: 6, 1: 4, 18: 4, 14: 4, 10: 4, 9: 4, 4: 4, 7: 2, 20: 2, 15: 2, 5: 2, 8: 2})

from collections import Counter
import re

# 词频统计，取出前5
with open('example.txt') as f:
    txt = f.read()
    w = re.split('\W+', txt)
    print(w)
    c2 = Counter(w)
    r = c2.most_common(5)
    print(r)
    # [('a', 21), ('the', 16), ('to', 15), ('and', 12), ('Service', 8)]

参考文档

标签：20,randint,Counter,print,topK,频度,序列,import
来源： https://www.cnblogs.com/keithtt/p/15174267.html

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

统计序列中元素出现的频度并获取topK