013 将数据离散化

2021-05-02 22:05:16 阅读：189 来源： 互联网

标签：count ... df list 离散 print 013 genre 数据

pandas将数据离散化

要求统计：给出一个电影数据，将其中的所有电影，按照分类统计各类型电影的数量

数据格式：

 Rank                    Title                     Genre  \
0     1  Guardians of the Galaxy   Action,Adventure,Sci-Fi   
1     2               Prometheus  Adventure,Mystery,Sci-Fi   
2     3                    Split           Horror,Thriller   
3     4                     Sing   Animation,Comedy,Family   
4     5            Suicide Squad  Action,Adventure,Fantasy   

                                         Description              Director  \
0  A group of intergalactic criminals are forced ...            James Gunn   
1  Following clues to the origin of mankind, a te...          Ridley Scott   
2  Three girls are kidnapped by a man with a diag...    M. Night Shyamalan   
3  In a city of humanoid animals, a hustling thea...  Christophe Lourdelet   
4  A secret government agency recruits some of th...            David Ayer   

                                              Actors  Year  Runtime (Minutes)  \
0  Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...  2014                121   
1  Noomi Rapace, Logan Marshall-Green, Michael Fa...  2012                124   
2  James McAvoy, Anya Taylor-Joy, Haley Lu Richar...  2016                117   
3  Matthew McConaughey,Reese Witherspoon, Seth Ma...  2016                108   
4  Will Smith, Jared Leto, Margot Robbie, Viola D...  2016                123

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

# 文件地址
file_path = '数据文件'
# 显示设置
pd.set_option('display.max_columns', 20)
# 读取数据
df = pd.read_csv(file_path)

# 统计分类，转化为一个列表，形式为 [[],[],[]]
temp_list = df['Genre'].str.split(',').tolist()
# print(temp_list)
# 将分类去重，得到一个内容不重复的列表
genre_list = list(set([i for j in temp_list for i in j]))

# 构造全为0的数组
zeros_df = pd.DataFrame(np.zeros((df.shape[0], len(genre_list))), columns=genre_list)
# print(zero_df)

# 遍历每一行数据，给每个电影类型出现的位置赋值为1
for i in range(df.shape[0]):
    # 使用loc将出现的具体位置置1，temp_list[i]检索出来的时电影的名字，对应的正好是 zero_df中的列
    zeros_df.loc[i, temp_list[i]] = 1
# print(zero_df.head(3))

# 统计每个分类的电影的数量和
genre_count = zeros_df.sum(axis=0)
print(genre_count.sort_values())
# 排序
# genre_count = genre_count.sort_values()
# _x = genre_count.index
# _y = genre_count.values
# # print(_x)
# # print("*"*100)
# # print(_y)
#
#
# # 画图
# plt.figure(figsize=(20, 8), dpi=80)
# plt.bar(range(len(_x)), _y)
# plt.xticks(range(len(_x)), _x)
# plt.show()

效果：

Musical        5.0
Western        7.0
War           13.0
Music         16.0
Sport         18.0
History       29.0
Animation     49.0
Family        51.0
Biography     81.0
Fantasy      101.0
Mystery      106.0
Horror       119.0
Sci-Fi       120.0
Romance      141.0
Crime        150.0
Thriller     195.0
Adventure    259.0
Comedy       279.0
Action       303.0
Drama        513.0
dtype: float64

Process finished with exit code 0

标签：count,...,df,list,离散,print,013,genre,数据
来源： https://blog.csdn.net/weixin_47326735/article/details/116357417

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

013 将数据离散化

pandas将数据离散化