【爬蟲(chóng)】Python爬蟲(chóng)的基本思路
Posted on 2023-05-21 09:05 Charlie_ODD 閱讀(94) 評(píng)論(0) 收藏 舉報(bào)基礎(chǔ)
- 一句話(huà)描述:利用http/https協(xié)議,通過(guò)python自帶的requests相關(guān)包,模擬真實(shí)的Web瀏覽器請(qǐng)求,將原本在瀏覽器所見(jiàn)的內(nèi)容以代碼的形式結(jié)構(gòu)化的保存下所需要的信息。
- 等價(jià)工具:
- curl
- 瀏覽器抓包/F12
- 抓取任意一個(gè)網(wǎng)站的內(nèi)容:
- 瀏覽器訪(fǎng)問(wèn)網(wǎng)站,打開(kāi)開(kāi)發(fā)者工具,找到所需要信息的URL、headers、及response的結(jié)構(gòu)
- 使用curl命令復(fù)現(xiàn)上一步
- 編碼爬蟲(chóng)腳本,等價(jià)編程復(fù)現(xiàn)前兩步
- 拿到結(jié)果后,通常會(huì)對(duì)字符編碼格式統(tǒng)一處理,response結(jié)構(gòu)解析(json/re正則表達(dá)式),最后將“我們感興趣的信息”做結(jié)構(gòu)化的展示和保存
- 常用到的包:
import requests
import re
import pandas as pd
import json
示例:大麥演唱會(huì)信息
訪(fǎng)問(wèn)官網(wǎng)
官網(wǎng)入口: https://www.damai.cn/
找到演唱會(huì)檢索信息頁(yè)面:https://search.damai.cn/search.htm?spm=a2oeg.home.category.ditem_0.591b23e1VeeW0U&ctl=演唱會(huì)&order=1&cty=北京
開(kāi)發(fā)者工具
找到想要的信息的url
- 條件查詢(xún)項(xiàng)目列表:
- 找到curl命令:
curl 'https://search.damai.cn/searchajax.html?keyword=&cty=%E5%8C%97%E4%BA%AC&ctl=%E6%BC%94%E5%94%B1%E4%BC%9A&sctl=&tsg=0&st=&et=&order=1&pageSize=30&currPage=2&tn=' \
-H 'authority: search.damai.cn' \
-H 'accept: application/json, text/plain, */*' \
-H 'accept-language: zh-CN,zh;q=0.9' \
-H 'bx-v: 2.5.0' \
-H 'cookie: cna=oxcgGJ2s5zUCAd6ywe2CLo4A; xlly_s=1; XSRF-TOKEN=91bc667b-b388-4801-9af5-b6bb519781ae; isg=BP__g5cvp18oPKNCPa5V2QsYjtOJ5FOGR4JodJHN8a6ToB4imbfp1-x24nBe-Cv-; tfstk=cT5NBPAaZ5Fat_7Y2BAV45v7Pu9OaBRDw0TkIkfxpb9E4kpM4svmDejURL9t8mvG.; l=fBLrt9wgNFgHDrQWBO5Bhurza77TfIOb8sPzaNbMiIEGa6OdtaiODNC_fvp6SdtjgT5VqetzEHwxcdUWSz4U-tMot9HJfRskeX9w8eM3N7AN.' \
-H 'referer: https://search.damai.cn/search.htm?spm=a2oeg.home.category.ditem_0.591b23e1KckJWg&ctl=%E6%BC%94%E5%94%B1%E4%BC%9A&order=1&cty=%E5%8C%97%E4%BA%AC' \
-H 'sec-ch-ua: "Not?A_Brand";v="8", "Chromium";v="108", "Google Chrome";v="108"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'sec-ch-ua-platform: "Windows"' \
-H 'sec-fetch-dest: empty' \
-H 'sec-fetch-mode: cors' \
-H 'sec-fetch-site: same-origin' \
-H 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36' \
-H 'x-xsrf-token: 91bc667b-b388-4801-9af5-b6bb519781ae' \
--compressed
- 用代碼模擬curl url+headers+payload
todo
本文來(lái)自博客園,作者:Charlie_ODD,轉(zhuǎn)載請(qǐng)注明原文鏈接:http://www.rzrgm.cn/chihaoyuIsnotHere/p/17418206.html
浙公網(wǎng)安備 33010602011771號(hào)