bs4 - HTML操作
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" id="link1">Elsie</a>,
<a class="sister" id="link2">Lacie</a> and
<a class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
遍歷文檔樹
子節(jié)點(diǎn)
一個(gè)
Tag可能包含多個(gè)字符串或其它的Tag,這些都是這個(gè)Tag的子節(jié)點(diǎn)。Beautiful Soup提供了許多操作和遍歷子節(jié)點(diǎn)的屬性注意:
Beautiful Soup中字符串節(jié)點(diǎn)不支持這些屬性,因?yàn)樽址疀]有子節(jié)點(diǎn)
tag的名字
head_tag = soup.head
# <head><title>The Dormouse's story</title></head>
head_name = head_tag.name
# head
.contents 和 .children
tag的.contents屬性可以將tag的子節(jié)點(diǎn)以列表的方式輸出
head_tag = soup.head
# <head><title>The Dormouse's story</title></head>
head_tag.contents
# [<title>The Dormouse's story</title>]
title_tag = head_tag.contents[0]
# <title>The Dormouse's story</title>
title_tag.contents
# [u'The Dormouse's story']
通過tag的 .children 生成器,可以對(duì)tag的子節(jié)點(diǎn)進(jìn)行循環(huán)
for child in soup.title.children:
print(child)
# The Dormouse's story
.descendants
.descendants屬性可以對(duì)所有tag的子孫節(jié)點(diǎn)進(jìn)行遞歸循環(huán)
for child in head_tag.descendants:
print(child)
# <title>The Dormouse's story</title>
# The Dormouse's story
BeautifulSoup 有一個(gè)直接子節(jié)點(diǎn)(<html>節(jié)點(diǎn)),卻有很多子孫節(jié)點(diǎn)
len(list(soup.children))
# 1
len(list(soup.descendants))
# 25
.string
tag如果僅有一個(gè)子節(jié)點(diǎn)或只有一個(gè) NavigableString 類型子節(jié)點(diǎn),都可以用.string獲取子節(jié)點(diǎn)(文本)
# 只有一個(gè)NavigableString類型子節(jié)點(diǎn)
soup.title.string
# u'The Dormouse's story'
# 僅有一個(gè)子節(jié)點(diǎn)
soup.head.contents
# [<title>The Dormouse's story</title>]
soup.head.string
# u'The Dormouse's story'
.strings 和 .stripped_strings
如果tag中包含多個(gè)字符串 ,可以使用 .strings 來循環(huán)獲取
for string in soup.strings:
print(repr(string))
# u"The Dormouse's story"
# u'\n\n'
# u"The Dormouse's story"
# u'\n\n'
# u'Once upon a time there were three little sisters; and their names were\n'
# u'Elsie'
# u',\n'
# u'Lacie'
# u' and\n'
# u'Tillie'
# u';\nand they lived at the bottom of a well.'
# u'\n\n'
# u'...'
# u'\n'
使用 .stripped_strings 可以去除多余空白內(nèi)容
for string in soup.stripped_strings:
print(repr(string))
# u"The Dormouse's story"
# u"The Dormouse's story"
# u'Once upon a time there were three little sisters; and their names were'
# u'Elsie'
# u','
# u'Lacie'
# u'and'
# u'Tillie'
# u';\nand they lived at the bottom of a well.'
# u'...'
父節(jié)點(diǎn)
.parent 和 .parents
通過 .parent 屬性來獲取某個(gè)元素的父節(jié)點(diǎn)
title_tag = soup.title
# <title>The Dormouse's story</title>
title_tag.parent
# <head><title>The Dormouse's story</title></head>
通過元素的 .parents 屬性可以遞歸得到元素的所有父輩節(jié)點(diǎn)
link = soup.a
# <a class="sister" id="link1">Elsie</a>
for parent in link.parents:
if parent is None:
print(parent)
else:
print(parent.name)
# p
# body
# html
# [document]
# None
兄弟節(jié)點(diǎn)
.next_sibling 和 .previous_sibling
在文檔樹中,使用 .next_sibling 和 .previous_sibling 屬性來查詢兄弟節(jié)點(diǎn)
"""
<html>
<body>
<a>
<b>text1</b>
<c>text2</c>
</a>
</body>
</html>
"""
sibling_soup.b.next_sibling
# <c>text2</c>
sibling_soup.c.previous_sibling
# <b>text1</b>
.next_siblings 和 .previous_siblings
通過 .next_siblings 和 .previous_siblings 屬性可以對(duì)當(dāng)前節(jié)點(diǎn)的兄弟節(jié)點(diǎn)迭代輸出
for sibling in soup.a.next_siblings:
print(repr(sibling))
# u',\n'
# <a class="sister" id="link2">Lacie</a>
# u' and\n'
# <a class="sister" id="link3">Tillie</a>
# u'; and they lived at the bottom of a well.'
# None
for sibling in soup.find(id="link3").previous_siblings:
print(repr(sibling))
# ' and\n'
# <a class="sister" id="link2">Lacie</a>
# u',\n'
# <a class="sister" id="link1">Elsie</a>
# u'Once upon a time there were three little sisters; and their names were\n'
# None
前進(jìn)和回退
.next_element 和 .previous_element
.next_element 屬性指向解析過程中下一個(gè)被解析的對(duì)象(字符串或tag),結(jié)果可能與 .next_sibling 相同,但通常是不一樣的
last_a_tag = soup.find("a", id="link3")
# <a class="sister" id="link3">Tillie</a>
last_a_tag.next_sibling
# '; \nand they lived at the bottom of a well.'
這個(gè)<a>標(biāo)簽的 .next_element 屬性結(jié)果是在<a>標(biāo)簽被解析之后的解析內(nèi)容,不是<a>標(biāo)簽后的句子部分,應(yīng)該是字符串Tillie。這是因?yàn)樵谠嘉臋n中,字符串Tillie 在分號(hào)前出現(xiàn),解析器先進(jìn)入<a>標(biāo)簽,然后是字符串Tillie,然后關(guān)閉</a>標(biāo)簽,然后是分號(hào)和剩余部分。分號(hào)與<a>標(biāo)簽在同一層級(jí),但是字符串Tillie會(huì)被先解析
last_a_tag.next_element
# u'Tillie'
.previous_element 屬性剛好與 .next_element 相反,它指向當(dāng)前被解析的對(duì)象的前一個(gè)解析對(duì)象
last_a_tag.previous_element
# u' and\n'
last_a_tag.previous_element.next_element
# <a class="sister" id="link3">Tillie</a>
.next_elements 和 .previous_elements
通過 .next_elements 和 .previous_elements 的迭代器就可以向前或向后訪問文檔的解析內(nèi)容,就好像文檔正在被解析一樣
for element in last_a_tag.next_elements:
print(repr(element))
# u'Tillie'
# u';\nand they lived at the bottom of a well.'
# u'\n\n'
# <p class="story">...</p>
# u'...'
# u'\n'
# None
搜索文檔樹
find() 和 find_all()
find/find_all( name , attrs , recursive , string , **kwargs )使用方法相同,唯一區(qū)別,
find返回值是元素本身,不存在時(shí)返回None;find_all返回值是n個(gè)元素的列表,不存在時(shí)返回空列表
搜索當(dāng)前tag的所有子節(jié)點(diǎn),并判斷是否符合過濾器的條件
soup.find_all("title")
# [<title>The Dormouse's story</title>]
soup.find_all("p", "title")
# [<p class="title"><b>The Dormouse's story</b></p>]
soup.find_all("a")
# [<a class="sister" id="link1">Elsie</a>,
# <a class="sister" id="link2">Lacie</a>,
# <a class="sister" id="link3">Tillie</a>]
soup.find_all(id="link2")
# [<a class="sister" id="link2">Lacie</a>]
import re
soup.find(string=re.compile("sisters"))
# u'Once upon a time there were three little sisters; and their names were\n'
keyword參數(shù)
1、如果一個(gè)參數(shù)不是函數(shù)的形參搜索時(shí)會(huì)把該參數(shù)當(dāng)作指定名字tag的屬性來搜索
soup.find_all(id='link2')
# [<a class="sister" id="link2">Lacie</a>]
soup.find_all(href=re.compile("elsie"))
# [<a class="sister" id="link1">Elsie</a>]
2、在文檔樹中查找所有包含 id 屬性的tag,無論 id 的值是什么
soup.find_all(id=True)
# [<a class="sister" id="link1">Elsie</a>,
# <a class="sister" id="link2">Lacie</a>,
# <a class="sister" id="link3">Tillie</a>]
3、使用多個(gè)指定名字的參數(shù)可以同時(shí)過濾tag的多個(gè)屬性
soup.find_all(href=re.compile("elsie"), id='link1')
# [<a class="sister" id="link1">three</a>]
4、有些tag屬性的搜索不能使用,比如HTML5中的 data-* 屬性。但是可以通過 find_all() 方法的 attrs 參數(shù)定義一個(gè)字典參數(shù)來搜索包含特殊屬性的tag
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]
按CSS搜索
Beautiful Soup的4.1.1版本開始,可以通過
class_參數(shù)搜索有指定CSS類名的tag
soup.find_all("a", class_="sister")
# [<a class="sister" id="link1">Elsie</a>,
# <a class="sister" id="link2">Lacie</a>,
# <a class="sister" id="link3">Tillie</a>]
1、class_ 參數(shù)同樣接受不同類型的 過濾器 ,字符串,正則表達(dá)式,方法或 True
soup.find_all(class_=re.compile("itl"))
# [<p class="title"><b>The Dormouse's story</b></p>]
def has_six_characters(css_class):
return css_class is not None and len(css_class) == 6
soup.find_all(class_=has_six_characters)
# [<a class="sister" id="link1">Elsie</a>,
# <a class="sister" id="link2">Lacie</a>,
# <a class="sister" id="link3">Tillie</a>]
2、tag的 class 屬性是,按照CSS類名搜索tag時(shí),可以分別搜索tag中的每個(gè)CSS類名,也可以通過CSS值完全匹配:
css_soup = BeautifulSoup('<p class="body strikeout"></p>', 'lxml')
css_soup.find_all("p", class_="strikeout")
# [<p class="body strikeout"></p>]
css_soup.find_all("p", class_="body")
# [<p class="body strikeout"></p>]
css_soup.find_all("p", class_="body strikeout")
# [<p class="body strikeout"></p>]
注意:完全匹配 class 的值時(shí),如果CSS類名的順序與實(shí)際不符,將搜索不到結(jié)果:
css_soup.find_all("p", class_="strikeout body")
# []
string參數(shù)
通過 string 參數(shù)可以搜索文檔中的字符串內(nèi)容.與 name 參數(shù)的可選值一樣, string 參數(shù)接受:字符串 , 正則表達(dá)式 , 列表, True 。
soup.find_all(string="Elsie")
# [u'Elsie']
soup.find_all(string=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']
soup.find_all(string=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"]
def is_the_only_string_within_a_tag(s):
"""Return True if this string is the only child of its parent tag."""
return s == s.parent.string
soup.find_all(string=is_the_only_string_within_a_tag)
# [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']
還可以與其它參數(shù)混合使用來過濾tag.Beautiful Soup會(huì)找到 .string 方法與 string 參數(shù)值相符的tag
soup.find_all("a", string="Elsie")
# [<a class="sister" id="link1">Elsie</a>]
limit參數(shù)
限制返回結(jié)果的數(shù)量
soup.find_all("a", limit=2)
# [<a class="sister" id="link1">Elsie</a>,
# <a class="sister" id="link2">Lacie</a>]
recursive參數(shù)
調(diào)用tag的 find_all() 方法時(shí),Beautiful Soup會(huì)檢索當(dāng)前tag的所有子孫節(jié)點(diǎn),如果只想搜索tag的直接子節(jié)點(diǎn),可以使用參數(shù) recursive=False .
soup.html.find_all("title")
# [<title>The Dormouse's story</title>]
soup.html.find_all("title", recursive=False)
# []
<title>標(biāo)簽在<html>標(biāo)簽下, 但并不是直接子節(jié)點(diǎn),<head>標(biāo)簽才是直接子節(jié)點(diǎn). 在允許查詢所有后代節(jié)點(diǎn)時(shí)Beautiful Soup能夠查找到<title>標(biāo)簽. 但是使用了recursive=False參數(shù)之后,只能查找直接子節(jié)點(diǎn),這樣就查不到<title>標(biāo)簽了
tag也可以被調(diào)用
BeautifulSoup 對(duì)象和 tag 對(duì)象可以被當(dāng)作一個(gè)方法來使用,這個(gè)方法的執(zhí)行結(jié)果與調(diào)用這個(gè)對(duì)象的 find_all() 方法相同,如下:
soup.find_all("a")
soup("a")
或者:
soup.title.find_all(string=True)
soup.title(string=True)
拓展:soup.head.title 是tag的名字方法的簡(jiǎn)寫.這個(gè)簡(jiǎn)寫的原理就是多次調(diào)用當(dāng)前tag的 find() 方法
soup.head.title
# <title>The Dormouse's story</title>
soup.find("head").find("title")
# <title>The Dormouse's story</title>
find_parents() 和 find_parent()
find_parents( name , attrs , recursive , string , **kwargs )
a_string = soup.find(string="Lacie")
# u'Lacie'
a_string.find_parents("a")
# [<a class="sister" id="link2">Lacie</a>]
a_string.find_parent("p")
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" id="link1">Elsie</a>,
# <a class="sister" id="link2">Lacie</a> and
# <a class="sister" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
a_string.find_parents("p", class="title")
# []
文檔中的一個(gè)<a>標(biāo)簽是當(dāng)前子節(jié)點(diǎn)的直接父節(jié)點(diǎn),所以可以被找到.還有一個(gè)<p>標(biāo)簽是目標(biāo)子節(jié)點(diǎn)的間接父輩節(jié)點(diǎn),所以也可以被找到.包含class值為title的<p>標(biāo)簽不是目標(biāo)子節(jié)點(diǎn)的父輩節(jié)點(diǎn),所以通過 find_parents() 方法搜索不到.
find_next_siblings() 和 find_next_sibling()
find_next_siblings( name , attrs , recursive , string , **kwargs
find_next_siblings() 方法返回所有符合條件的后面的兄弟節(jié)點(diǎn), find_next_sibling() 只返回符合條件的后面的第一個(gè)tag節(jié)點(diǎn)
first_link = soup.a
# <a class="sister" id="link1">Elsie</a>
first_link.find_next_siblings("a")
# [<a class="sister" id="link2">Lacie</a>,
# <a class="sister" id="link3">Tillie</a>]
first_story_paragraph = soup.find("p", "story")
first_story_paragraph.find_next_sibling("p")
# <p class="story">...</p>
find_previous_siblings() 和 find_previous_sibling()
find_previous_siblings( name , attrs , recursive , string , **kwargs )
find_previous_siblings() 方法返回所有符合條件的前面的兄弟節(jié)點(diǎn), find_previous_sibling() 方法返回第一個(gè)符合條件的前面的兄弟節(jié)點(diǎn)
last_link = soup.find("a", id="link3")
# <a class="sister" id="link3">Tillie</a>
last_link.find_previous_siblings("a")
# [<a class="sister" id="link2">Lacie</a>,
# <a class="sister" id="link1">Elsie</a>]
first_story_paragraph = soup.find("p", "story")
first_story_paragraph.find_previous_sibling("p")
# <p class="title"><b>The Dormouse's story</b></p>
find_all_next() 和 find_next()
find_all_next( name , attrs , recursive , string , **kwargs )
這2個(gè)方法通過.next_elements屬性對(duì)當(dāng)前tag之后的tag和字符串進(jìn)行迭代, find_all_next() 方法返回所有符合條件的節(jié)點(diǎn), find_next() 方法返回第一個(gè)符合條件的節(jié)點(diǎn)
first_link = soup.a
# <a class="sister" id="link1">Elsie</a>
first_link.find_all_next(string=True)
# [u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie',
# u';\nand they lived at the bottom of a well.', u'\n\n', u'...', u'\n']
first_link.find_next("p")
# <p class="story">...</p>
find_all_previous() 和 find_previous()
find_all_previous( name , attrs , recursive , string , **kwargs )
這2個(gè)方法通過.previous_elements屬性對(duì)當(dāng)前節(jié)點(diǎn)前面的tag和字符串進(jìn)行迭代, find_all_previous() 方法返回所有符合條件的節(jié)點(diǎn), find_previous() 方法返回第一個(gè)符合條件的節(jié)點(diǎn)
first_link = soup.a
# <a class="sister" id="link1">Elsie</a>
first_link.find_all_previous("p")
# [<p class="story">Once upon a time there were three little sisters; ...</p>,
# <p class="title"><b>The Dormouse's story</b></p>]
first_link.find_previous("title")
# <title>The Dormouse's story</title>
CSS選擇器
Beautiful Soup支持大部分的CSS選擇器 http://www.w3.org/TR/CSS2/selector.html , 在 Tag 或 BeautifulSoup 對(duì)象的 .select() 方法中傳入字符串參數(shù), 即可使用CSS選擇器的語法找到tag
soup.select("title")
# [<title>The Dormouse's story</title>]
soup.select("p:nth-of-type(3)")
# [<p class="story">...</p>]
1、通過tag標(biāo)簽逐層查找
soup.select("body a")
# [<a class="sister" id="link1">Elsie</a>,
# <a class="sister" id="link2">Lacie</a>,
# <a class="sister" id="link3">Tillie</a>]
soup.select("html head title")
# [<title>The Dormouse's story</title>]
2、找到某個(gè)tag標(biāo)簽下的直接子標(biāo)簽
soup.select("head > title")
# [<title>The Dormouse's story</title>]
soup.select("p > a")
# [<a class="sister" id="link1">Elsie</a>,
# <a class="sister" id="link2">Lacie</a>,
# <a class="sister" id="link3">Tillie</a>]
soup.select("p > a:nth-of-type(2)")
# [<a class="sister" id="link2">Lacie</a>]
soup.select("p > #link1")
# [<a class="sister" id="link1">Elsie</a>]
soup.select("body > a")
# []
3、找到兄弟節(jié)點(diǎn)標(biāo)簽
soup.select("#link1 ~ .sister")
# [<a class="sister" id="link2">Lacie</a>,
# <a class="sister" id="link3">Tillie</a>]
soup.select("#link1 + .sister")
# [<a class="sister" id="link2">Lacie</a>]
4、通過CSS的類名查找
soup.select(".sister")
# [<a class="sister" id="link1">Elsie</a>,
# <a class="sister" id="link2">Lacie</a>,
# <a class="sister" id="link3">Tillie</a>]
soup.select("[class~=sister]")
# [<a class="sister" id="link1">Elsie</a>,
# <a class="sister" id="link2">Lacie</a>,
# <a class="sister" id="link3">Tillie</a>]
5、通過tag的id查找
soup.select("#link1")
# [<a class="sister" id="link1">Elsie</a>]
soup.select("a#link2")
# [<a class="sister" id="link2">Lacie</a>]
6、同時(shí)用多種CSS選擇器查詢?cè)?/p>
soup.select("#link1,#link2")
# [<a class="sister" id="link1">Elsie</a>,
# <a class="sister" id="link2">Lacie</a>]
7、通過是否存在某個(gè)屬性來查找
soup.select('a[href]')
# [<a class="sister" id="link1">Elsie</a>,
# <a class="sister" id="link2">Lacie</a>,
# <a class="sister" id="link3">Tillie</a>]
8、通過屬性的值來查找
soup.select('a[)
# [<a class="sister" id="link1">Elsie</a>]
soup.select('a[href^="http://example.com/"]')
# [<a class="sister" id="link1">Elsie</a>,
# <a class="sister" id="link2">Lacie</a>,
# <a class="sister" id="link3">Tillie</a>]
soup.select('a[href$="tillie"]')
# [<a class="sister" id="link3">Tillie</a>]
soup.select('a[href*=".com/el"]')
# [<a class="sister" id="link1">Elsie</a>]
9、通過語言設(shè)置來查找
multilingual_markup = """
<p lang="en">Hello</p>
<p lang="en-us">Howdy, y'all</p>
<p lang="en-gb">Pip-pip, old fruit</p>
<p lang="fr">Bonjour mes amis</p>
"""
multilingual_soup = BeautifulSoup(multilingual_markup)
multilingual_soup.select('p[lang|=en]')
# [<p lang="en">Hello</p>,
# <p lang="en-us">Howdy, y'all</p>,
# <p lang="en-gb">Pip-pip, old fruit</p>]
10、返回查找到的元素的第一個(gè)
soup.select_one(".sister")
# <a class="sister" id="link1">Elsie</a>
修改文檔樹
修改tag的名稱和屬性
重命名一個(gè)tag,改變屬性的值,添加或刪除屬性
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
tag.name = "blockquote"
tag['class'] = 'verybold'
tag['id'] = 1
tag
# <blockquote class="verybold" id="1">Extremely bold</blockquote>
del tag['class']
del tag['id']
tag
# <blockquote>Extremely bold</blockquote>
修改 .string
給tag的 .string 屬性賦值,就相當(dāng)于用當(dāng)前的內(nèi)容替代了原來的內(nèi)容
markup = '<a >I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
tag = soup.a
tag.string = "New link text."
tag
# <a >New link text.</a>
注意: 如果當(dāng)前的tag包含了其它tag,那么給它的 .string 屬性賦值會(huì)覆蓋掉原有的所有內(nèi)容包括子tag
append()
Tag.append() 方法向tag中添加內(nèi)容
soup = BeautifulSoup("<a>Foo</a>")
soup.a.append("Bar")
soup
# <html><head></head><body><a>FooBar</a></body></html>
soup.a.contents
# [u'Foo', u'Bar']
insert()
Tag.insert() 方法與 Tag.append() 方法類似,區(qū)別是不會(huì)把新元素添加到父節(jié)點(diǎn) .contents 屬性的最后,而是把元素插入到指定的位置
markup = '<a >I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
tag = soup.a
tag.insert(1, "but did not endorse ")
tag
# <a >I linked to but did not endorse <i>example.com</i></a>
tag.contents
# [u'I linked to ', u'but did not endorse', <i>example.com</i>]
insert_before() 和 insert_after()
insert_before() 方法在當(dāng)前tag或文本節(jié)點(diǎn)前插入內(nèi)容
soup = BeautifulSoup("<b>stop</b>")
tag = soup.new_tag("i")
tag.string = "Don't"
soup.b.string.insert_before(tag)
soup.b
# <b><i>Don't</i>stop</b>
insert_after() 方法在當(dāng)前tag或文本節(jié)點(diǎn)后插入內(nèi)容
soup.b.i.insert_after(soup.new_string(" ever "))
soup.b
# <b><i>Don't</i> ever stop</b>
soup.b.contents
# [<i>Don't</i>, u' ever ', u'stop']
replace_with()
PageElement.replace_with() 方法移除文檔樹中的某段內(nèi)容,并用新tag或文本節(jié)點(diǎn)替代它
markup = '<a >I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
a_tag = soup.a
new_tag = soup.new_tag("b")
new_tag.string = "example.net"
a_tag.i.replace_with(new_tag)
a_tag
# <a >I linked to <b>example.net</b></a>
本文來自博客園,僅供參考學(xué)習(xí),如有不當(dāng)之處還望不吝賜教,不勝感激!轉(zhuǎn)載請(qǐng)注明原文鏈接:http://www.rzrgm.cn/rong-z/p/17879908.html
作者:cnblogs用戶
浙公網(wǎng)安備 33010602011771號(hào)