亚洲成人av在线资源,国产草草影院ccyycom,国产一卡2卡三卡4卡免费网站

一個完整的大作業(yè)

1.選一個自己感興趣的主題。

2.網(wǎng)絡上爬取相關的數(shù)據(jù)。

3.進行文本分析，生成詞云。

4.對文本分析結(jié)果解釋說明。

5.寫一篇完整的博客，附上源代碼、數(shù)據(jù)爬取及分析結(jié)果，形成一個可展示的成果。

1、選一個自己感興趣的主題

我這里選擇的主題是豆瓣讀書上《二手時間》的短評，爬取的網(wǎng)站是：https://book.douban.com/subject/26704403/comments/

2、獲取網(wǎng)頁上的短評，并生成文件subjects.txt 代碼如下：

from os import path
import requests
from scipy.misc import imread
from wordcloud import WordCloud
from bs4 import BeautifulSoup

def fetch_douban_comments():
    r = requests.get('https://book.douban.com/subject/26704403/comments/')
    soup = BeautifulSoup(r.text, 'lxml')
    pattern = soup.find_all('p', 'comment-content')
    with open('subjects.txt', 'w', encoding='utf-8') as f:
        for s in pattern:
            f.write(s.string)

　效果如下圖：

3、對文本進行分析，并生成詞云代碼如下：

def extract_words():
    with open('subjects.txt','r',encoding='utf-8') as f:
        comment_subjects = f.readlines()
        
    stop_words = set(line.strip() for line in open('stopwords.txt', encoding='utf-8'))
    
    commentlist = []
    for subject in comment_subjects:
        if subject.isspace():continue 
        word_list = pseg.cut(subject)#分詞
        for word, flag in word_list:
            if not word in stop_words and flag == 'n':#名詞
                commentlist.append(word)

生成詞云：

 d = path.dirname(__file__)
    mask_image = imread(path.join(d, "apple.jpg"))
    content = ' '.join(commentlist)
    wordcloud = WordCloud(font_path='simhei.ttf', background_color="white",  mask=mask_image, max_words=60).generate(content)
    # Display the generated image:
    plt.imshow(wordcloud)
    plt.axis("off")
    wordcloud.to_file('wordcloud.jpg')
    plt.show()
if __name__ == "__main__":
    fetch_douban_comments()
    extract_words()

　　生成的詞云圖為：

posted on 2017-11-02 16:44 32-劉振威閱讀(441) 評論(0) 收藏舉報

刷新頁面返回頂部

32-劉振威

導航

公告

一個完整的大作業(yè)