使用scrapy shell 進行爬取過程中的調試
參考文檔:Scrapy shell — Scrapy 2.6.2 documentation
使用scrapy.shell.inspect_response函數進行爬取過程的調試:
例:在爬蟲中啟用shell
import scrapy class MySpider(scrapy.Spider): name = "myspider" start_urls = [ "http://example.com", "http://example.org", "http://example.net", ] def parse(self, response): # We want to inspect one specific response. if ".org" in response.url: from scrapy.shell import inspect_response #此處引入inspect_response inspect_response(response, self) #進入shell # Rest of parsing code.
當運行爬蟲后,回進入如下shell狀態:
2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.
?→com> (referer: None)
2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.
?→org> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x1e16b50>
...
>>> response.url
'http://example.org'
此時,可檢查期望的結果是否正常:
>>> response.xpath('//h1[@class="fn"]')
[]
不正常,可調用瀏覽器進行檢查:
>>> view(response)
True
最后,可鍵Ctrl-D(windows下鍵Ctrl-Z)來退出shell并繼續后續爬取:
>>> ^D
2014-01-23 17:50:03-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.
?→net> (referer: None)
...

浙公網安備 33010602011771號