自然语言处理(NLP)编程实践-1.1 使用逻辑回归实现情感分类

2021-05-02 23:01:24 阅读：482 来源： 互联网

标签：NLP 1.1 tweet print freqs np test theta 自然语言

内容汇总：https://blog.csdn.net/weixin_43093481/article/details/114989382?spm=1001.2014.3001.5501
代码：https://github.com/Ogmx/Natural-Language-Processing-Specialization
——————————————————————————————————————————

作业 1: 逻辑回归(Logistic Regression)

学习目标：
学习逻辑回归，你将会学习使用逻辑回归对推特进行情感分析。给出一个推特，你要判断其是正向情感还是负向情感。

具体而言，将会学习：

给出一段文本，学习如何提取特征用于逻辑回归
从零开始实现逻辑回归
应用逻辑回归进行NLP任务
测试逻辑回归算法
进行错误分析

我们将使用一系列推特数据。在最后你的模型应该能得到99%的准确率。

导入函数和数据

# run this cell to import nltk
import nltk
from os import getcwd

导入函数

从该地址下载本实验需要的数据documentation for the twitter_samples dataset.

twitter_samples: 执行以下命令来下载数据

nltk.download('twitter_samples')

stopwords: 执行以下命令来下载停用词词典:

nltk.download('stopwords')

从 utils.py 导入帮助函数:

process_tweet(): 清理文本、拆分单词、去停用词、词根化
build_freqs(): 用于统计语料库中各单词被标记为"1"或"0"次数(即正向和负向情感)。然后构建"freqs"词典，其中键为(word,label) tuple，值为出现次数

# add folder, tmp2, from our local workspace containing pre-downloaded corpora files to nltk's data path
# this enables importing of these files without downloading it again when we refresh our workspace

filePath = f"{getcwd()}/../tmp2/"
nltk.data.path.append(filePath)

import numpy as np
import pandas as pd
from nltk.corpus import twitter_samples 

from utils import process_tweet, build_freqs

准备数据

twitter_samples 中包含5000条正向推特数据集，5000条负向推特数据集，整体10,000条推特数据集
- 如果直接使用3个数据集，将会包含重复推特
- 因此只使用正向数据集和负向数据集

# select the set of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

数据划分: 20% 作为测试集, 80% 作为训练集

# split the data into two pieces, one for training and one for testing (validation set) 
test_pos = all_positive_tweets[4000:]
train_pos = all_positive_tweets[:4000]
test_neg = all_negative_tweets[4000:]
train_neg = all_negative_tweets[:4000]

train_x = train_pos + train_neg 
test_x = test_pos + test_neg

对正向标签和负向标签建立numpy数组

# combine positive and negative labels
train_y = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)
test_y = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)

# Print the shape train and test sets
print("train_y.shape = " + str(train_y.shape))
print("test_y.shape = " + str(test_y.shape))

train_y.shape = (8000, 1)
test_y.shape = (2000, 1)

使用 build_freqs() 函数构建频率词典.
- 强烈建议在 utils.py 中阅读 build_freqs() 函数代码来理解其原理

    for y,tweet in zip(ys, tweets):
        for word in process_tweet(tweet):
            pair = (word, y)
            if pair in freqs:
                freqs[pair] += 1
            else:
                freqs[pair] = 1

# create frequency dictionary
freqs = build_freqs(train_x, train_y)

# check the output
print("type(freqs) = " + str(type(freqs)))
print("len(freqs) = " + str(len(freqs.keys())))

type(freqs) = <class ‘dict’>
len(freqs) = 11346

处理推特

使用 process_tweet() 函数对推特中的每个单词进行向量化，去停用词和词根化

# test the function below
print('This is an example of a positive tweet: \n', train_x[0])
print('\nThis is an example of the processed version of the tweet: \n', process_tweet(train_x[0]))

This is an example of a positive tweet:
#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week

标签：NLP,1.1,tweet,print,freqs,np,test,theta,自然语言
来源： https://blog.csdn.net/weixin_43093481/article/details/116356879

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。