一個完整的大作業(yè)
1.選一個自己感興趣的主題。
2.網(wǎng)絡上爬取相關的數(shù)據(jù)。
3.進行文本分析,生成詞云。
4.對文本分析結(jié)果解釋說明。
5.寫一篇完整的博客,附上源代碼、數(shù)據(jù)爬取及分析結(jié)果,形成一個可展示的成果。
1、選一個自己感興趣的主題
我這里選擇的主題是豆瓣讀書上《二手時間》的短評,爬取的網(wǎng)站是:https://book.douban.com/subject/26704403/comments/

2、獲取網(wǎng)頁上的短評,并生成文件subjects.txt 代碼如下:
from os import path
import requests
from scipy.misc import imread
from wordcloud import WordCloud
from bs4 import BeautifulSoup
def fetch_douban_comments():
r = requests.get('https://book.douban.com/subject/26704403/comments/')
soup = BeautifulSoup(r.text, 'lxml')
pattern = soup.find_all('p', 'comment-content')
with open('subjects.txt', 'w', encoding='utf-8') as f:
for s in pattern:
f.write(s.string)
效果如下圖:

3、對文本進行分析,并生成詞云代碼如下:
def extract_words():
with open('subjects.txt','r',encoding='utf-8') as f:
comment_subjects = f.readlines()
stop_words = set(line.strip() for line in open('stopwords.txt', encoding='utf-8'))
commentlist = []
for subject in comment_subjects:
if subject.isspace():continue
word_list = pseg.cut(subject)#分詞
for word, flag in word_list:
if not word in stop_words and flag == 'n':#名詞
commentlist.append(word)
生成詞云:
d = path.dirname(__file__)
mask_image = imread(path.join(d, "apple.jpg"))
content = ' '.join(commentlist)
wordcloud = WordCloud(font_path='simhei.ttf', background_color="white", mask=mask_image, max_words=60).generate(content)
# Display the generated image:
plt.imshow(wordcloud)
plt.axis("off")
wordcloud.to_file('wordcloud.jpg')
plt.show()
if __name__ == "__main__":
fetch_douban_comments()
extract_words()
生成的詞云圖為:

浙公網(wǎng)安備 33010602011771號