Python requests+lxml 編寫簡單小說爬蟲

看到課表上下個學期要學python，有點吃鯨。計科專業還會安排這個課程。不過既然都安排了，肯定是要學的。學了差不多一個禮拜才把基礎語法看完 = =。，python比起其他語言確實簡潔不少，就是不能加分號感覺很別扭。閑話不多說了，下面是學完語法+面向CNDS編程的一個簡單爬蟲，寫個文字紀念下，以后回來看看也蠻有意義的。

文件下載：

藍奏云

使用方法:

解壓后點擊下載器.exe

ui的代碼就不貼出來了，沒什么意義，看著文檔自己拼湊就行。把具體方法思路實現寫一下吧，大部分變量都是ui（tkinter），這里寫下重要的幾個。

thread_end = False #用戶是否主動中斷爬取
thread_num = 3 #爬取線程數
is_runing = False #是否正在爬取
lock = threading.Lock() #線程鎖 防止線程爬取相同的章節

class Book:
    name = None
    words = None
    author = None
    url = None

線程方法

每個線程不斷從集合中拿取章節鏈接并調用爬取函數爬取，直到爬取完成或用戶主動中斷爬取。

def myThread():
    global catalogues_url, main_text,is_runing
    while len(catalogues_url) != 0:
        if thread_end == False:
            lock.acquire()
            url = catalogues_url.pop()
            lock.release()
            down_catalogue(url)
        else:
            main_text.insert(END, f"由于用戶停止爬取，{threading.current_thread().name}停止了工作。\n")
            main_text.yview_moveto(1)
            is_runing = False
            break
    if len(catalogues_url) == 0:
        main_text.insert(END, f"爬取完成,{threading.current_thread().name}停止了工作。\n")
        main_text.yview_moveto(1)
        is_runing = False

下載方法

這里的url是前面處理好了的單個章節鏈接，用一個集合裝著。請求到網頁后用工具直接處理存下來即可。

def down_catalogue(url):
    global main_label_str, main_progressbar, main_progressbar_value, main_win, down_local_path_str
    session = requests.session()
    # session.proxies = {"https": "106.14.255.124:80", "http": "58.246.58.150:9002", }
    session.keep_alive = False
    book_content_html = session.get(url)
    book_name = etree.HTML(book_content_html.text).xpath("/html/body/div[2]/div[3]/div[2]/a[3]/text()")
    book_catalogue_name = etree.HTML(book_content_html.text).xpath(
        "/html/body/div[2]/div[3]/div[3]/div/div[1]/div[2]/div[2]/text()")
    book_content = etree.HTML(book_content_html.text).xpath(
        "/html/body/div[2]/div[3]/div[3]/div/div[1]/div[5]/p/text()")
    temp = ""
    for i in book_content:
        temp += i + '\n'
    book_content = temp
    file = open(f"{down_local_path_str}\\{book_name}{book_catalogue_name}.txt", "a+")
    file.write(book_content)
    file.close()
    main_progressbar_value += 1;
    main_progressbar['value'] = main_progressbar_value
    main_win.update()
    print(temp)
    main_text.insert(END, f"{threading.current_thread().name}爬取的{book_name}{book_catalogue_name}下載完成\n")
    main_text.yview_moveto(1)
    session.close()
    # print(temp)

主要的方法就是這兩個了，剩下一堆雜七雜八的要優化或者美化看自己怎么想了。

一個一百多行的main函數....可以說是確切的xx了

posted @ 2023-02-07 16:19 *RavE 閱讀(272) 評論(0) 收藏舉報

刷新頁面返回頂部

RavE

Maintain self-discipline

Python requests+lxml 編寫簡單小說爬蟲

公告