数据科学实战系列_Twitter情感分析项目之一

2024-10-18 来源：威能网

【数据科学实战系列】

Twitter情感分析项目之一

《Twitter情感分析项目》的原作者为Ricky Kim，该项目系列共11篇文章，发布于领英、Towards Data Science等作者个人主页。本系列文章由中国人民大学刘岩和朝乐门负责翻译、整理和校对之后发布。

本文为该系列的第一篇，主要为数据集描述及清洗处理。

选择数据集

数据集为，“Sentiment140”，来源于斯坦福大学

下载地址：http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip 数据集内容为：

推文情感分类，0=消极，2=中立，4=积极推文编号，比如2087

推文日期，比如 Sat May 16 23:58:44 UTC 2009 查询字段，比如 lyx，若无，值为NO_QUERY 推文作者，比如 robotickilldozr 推文内容，比如 Lyx is cool

以下将数据导入python（原作者使用python版本为2.7），并查看数据。 import pandas as pd import numpy as np

import matplotlib.pyplot as plt

cols = ['sentiment','id','date','query_string','user','text']

df = pd.read_csv(\"./trainingandtestdata/training.1600000.processed.noemoticon.csv\ header=None, names=cols)

# 注意替换为本机文件存储路径及文件名 df.head() 结果：

统计情感分类的值 df.sentiment.value_counts() 结果：

该数据集共有160万条数据，但是情感类别值中只包括积极/消极两种，各占50%，不存在中立情感值。

以下去除不相关的列:

df.drop(['id','date','query_string','user'],axis=1,inplace=True) 查看消极情感数据

df[df.sentiment == 0].head(10) 结果：

查看积极情感数据

df[df.sentiment == 4].head(10) 结果：

根据以上查看结果可以发现，消极类别数据索引为0~799999，积极类别数据索引为800000以后。

初步数据处理

查看text列值长度，进行完整性检查 df['pre_clean_len'] = [len(t) for t in df.text] 初步构建数据字典，数据处理后需要更新 from pprint import pprint data_dict = { 'sentiment':{

'type':df.sentiment.dtype,

'description':'sentiment class - 0:negative, 1:positive' }, 'text':{

'type':df.text.dtype, 'description':'tweet text' },

'pre_clean_len':{

'type':df.pre_clean_len.dtype,

'description':'Length of the tweet before cleaning' },

'dataset_shape':df.shape }

pprint(data_dict) 结果：

绘制箱线图，查看text值的整体分布 fig, ax = plt.subplots(figsize=(5, 5)) plt.boxplot(df.pre_clean_len) plt.show() 结果：

twitter发文限制长度为140，但从箱线图可以看出，部分推文内容超出范围，因此对数据进行过滤。

df[df.pre_clean_len > 140].head(10) 结果：

以下将初步进行数据清洗，并最终生成清洗功能。

2.1 HTML编码处理

部分推文中存在HTML编码，如&、"等，通过BeautifulSoup库对其进行处理。

df.text[279] 结果：

from bs4 import BeautifulSoup

example1 = BeautifulSoup(df.text[279], 'lxml') print example1.get_text() 结果：

2.2 处理@内容

推文的@内容一般涉及到其他推特用户，但是对推文情感分析意义不大，需要去除这部分内容

df.text[343] 结果：

import re

re.sub(r'@[A-Za-z0-9]+','',df.text[343]) 结果：

2.3 处理URL链接

URL链接内容与@内容相同，对情感分析的意义较小，需要去除 df.text[0] 结果：

re.sub('https?://[A-Za-z0-9./]+','',df.text[0]) 结果：

2.4 处理UTF-8 BOM编码内容

部分推文中包含UTF-8 BOM编码内容，显示为\\xef\\xbf\\xbd等，将这部分内容替换为？字符

df.text[226] 结果：

testing = df.text[226].decode(\"utf-8-sig\") testing 结果：

testing.replace(u\"\�\结果：

2.5 处理#内容

#内容包含部分信息，不能全部去除，因此只除掉符号和非字母内容 df.text[175]

结果：

re.sub(\"[^a-zA-Z]\结果：

进行数据清理

通过以上数据清理方法，最终生成通用的数据清洗功能。 from nltk.tokenize import WordPunctTokenizer tok = WordPunctTokenizer() pat1 = r'@[A-Za-z0-9]+' pat2 = r'https?://[A-Za-z0-9./]+' combined_pat = r'|'.join((pat1, pat2)) def tweet_cleaner(text):

soup = BeautifulSoup(text, 'lxml') souped = soup.get_text()

stripped = re.sub(combined_pat, '', souped) try:

clean = stripped.decode(\"utf-8-sig\").replace(u\"\�\ except:

clean = stripped

letters_only = re.sub(\"[^a-zA-Z]\ lower_case = letters_only.lower()

# During the letters_only process two lines above, it has created unnecessay white spaces,

# I will tokenize and join together to remove unneccessary white spaces words = tok.tokenize(lower_case) return (\" \".join(words)).strip()

testing = df.text[:100] test_result = [] for t in testing:

test_result.append(tweet_cleaner(t)) test_result 结果：

以下将数据集分为四部分进行数据清洗 nums = [0,400000,800000,1200000,1600000] print \"Cleaning and parsing the tweets...\\n\" clean_tweet_texts = []

for i in xrange(nums[0],nums[1]): if( (i+1)%10000 == 0 ):

print \"Tweets %d of %d has been processed\" % ( i+1, nums[1] ) clean_tweet_texts.append(tweet_cleaner(df['text'][i])) 结果：

将清洗后数据保存为csv文件

clean_df = pd.DataFrame(clean_tweet_texts,columns=['text'])

clean_df['target'] = df.sentiment clean_df.head() 结果：

clean_df.to_csv('clean_tweet.csv',encoding='utf-8') csv = 'clean_tweet.csv'

my_df = pd.read_csv(csv,index_col=0) my_df.head() 结果：

以上为Twitter情感分析系列文章的第一篇。

原文地址：

https://towardsdatascience.com/another-twitter-sentiment-analysis-bb5b01ebad90 GitHub源码：

https://github.com/tthustla/twitter_sentiment_analysis_part1/blob/master/Capstone_part2.ipynb

文章下载地址：

https://mp.weixin.qq.com/s/eAcxizfwxRRHTLCbP7eb-g

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文

全部栏目

数据科学实战系列_Twitter情感分析项目之一