python爬蟲基礎

Python 爬蟲

bs4 網頁解析,獲取數據

Tag: 標簽及其內容任何存在于HTML語法中的標簽都可以用soup.訪問獲得
當HTML文檔中存在多個相同對應內容時，soup.返回第一個

for sibling in soup.a.next_sibling:
print(sibling) 遍歷后續節點
for sibling in soup.a.previous_sibling:
print(sibling) 遍歷前續節點
NavigableString: 標簽里的內容-字符串
BeautifulSoup:整篇文章
Comment: 一種特殊的NavigableString,輸出的內容不包含注釋符號

文檔的搜索

find_all() 字符串過濾可跟函數方法或者參數(可以使列表) limit 限制獲取數量
```
t_list=bs.findAll("a") 
```

search() 主要是用正則表達式驗證

t_list=bs.findAll(re.compile("\d") ) #包含數字

CSS選擇器
1. bs.select('title') 通過標簽查找
2. bs.select('.mnav') 通過類名查找
3. bs.select('#u1') 通過id查找
4. bs.select('a[class='bri]') 通過屬性查找
5. bs.select('head>title') 通過子標簽查找
6. bs.select('.manv~.bri') 通過兄弟標簽查找

re 正則表達式,進行文字匹配

search() 主要是用正則表達式驗證
1. re.findall("正則表達式","待匹配字符串")
2. re.sub("a","b","aacbs"):將字符串中的b替換為a

?

urllib.request urllib.error 指定URL獲取網頁數據

import urllib.request
# get請求
response=urllib.request.urlopen("http://www.baidu.com")
print(response.read().decode("utf-8"))

? httpbin.org 請求測試

? urllib.parse 解析器

import urllib.request
import urllib.parse
#post請求
data=bytes(urllib.parse.urlencode({"hello":"world"}),encoding="utf-8")
response=urllib.request.urlopen("http://httpbin.org/post",data=data)
print(response.read().decode("utf-8"))

? 可以在urlopen()中加入timeout=時間設置超時時間從而進行超時處理

response.status返回的狀態

response.getheaders() 獲得頭文件內容

response.getheaders("Server") 獲得Server的值

#爬蟲偽裝 主要偽裝瀏覽器標識
req=urllib.request.Request(url=url,data=data,headers=headers,method=post)

import urllib.request
import urllib.parse


url="https://movie.douban.com/"
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36 Edg/95.0.1020.53"}
req=urllib.request.Request(url=url,headers=headers)
response=urllib.request.urlopen(req)
print(response.read().decode("utf-8"))

?

xlwt 進行Excel操作

sqlte3 進行SQLite數據庫操作

爬取網頁
解析數據
保存數據

cv2匹配滑塊驗證碼

嗷嗚

posted @ 2021-12-29 17:50 菜菜蕪湖起飛閱讀(46) 評論(0) 收藏舉報

刷新頁面返回頂部