Python Beautiful Soup 库

Beautiful Soup，美味的汤，爬虫利器

安装

1	pip3 install beautifulsoup4

测试

1
2
3

from bs4 import BeautifulSoup
# or
import bs4

官方文档：https://beautifulsoup.readthedocs.io/zh_CN/latest/

使用

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("index.html"), "html.parser")
soup = BeautifulSoup("<html>data</html>", "html.parser")

Beautiful Soup 库解析器

各解析器的优缺点可以到官方文档中查看，下面只简单介绍

解析器	使用方法	条件
bs4 的 HTML 解析器	BeautifulSoup(mk, ‘html.parser’)	安装 bs4 库
lxml 的 HTML 解析器	BeautifulSoup(mk, ‘lxml’)	pip install lxml
lxml 的 XML 解析器	BeautifulSoup(mk, ‘xml’)	pip install lxml
html5lib 解析器	BeautifulSoup(mk, ‘html5lib’)	pip install html5lib

BeautifulSoup 类的基本元素

基本元素	说明
Tag	标签，最基本的信息组织单元，分别用 <> 和 </> 标明开头和结尾
Name	标签的名字，格式：Tag.name
Attributes	标签的属性，字典形式组织，格式：Tag.attrs
NavigableString	标签内非属性字符串，<>…</> 中字符串，格式：Tag.string
Comment	标签内字符串的注释部分，一种特殊的 Comment 类型

Tag

Tag 对象与 XML 或 HTML 原生文档中的 tag 相同

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', "html.parser")
tag = soup.b
print(tag) # <b class="boldest">Extremely bold</b>
print(type(tag)) # <class 'bs4.element.Tag'>

Name

每个 tag 都有自己的名字, 通过 .name 来获取

1
2
3

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', "html.parser")
tag = soup.b
print(tag.name) # b

Attributes

一个 tag 可能有很多个属性, tag 的属性操作方法与字典一样

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', "html.parser")
tag = soup.b
print(tag.attrs) # {'class': ['boldest']}
print(tag['class']) # ['boldest']
print(tag.attrs['class']) # ['boldest']

NavigableString

tag 中的字符串

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', "html.parser")
tag = soup.b
print(tag.string) # Extremely bold
print(type(tag.string)) # <class 'bs4.element.NavigableString'>

Comment

文档中的注释部分

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup,"html.parser")
comment = soup.b.string
print(comment) # Hey, buddy. Want to buy a used parser?
print(type(comment)) # <class 'bs4.element.Comment'>

遍历文档

示例文档

from bs4 import BeautifulSoup
html_doc = """
<html>
    <head><title>The Dormouse's story</title></head>
    <body>
        <p class="title"><b>The Dormouse's story</b></p>

        <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.</p>

        <p class="story">...</p>
    </body>
</html>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.name) # [document]
print(soup.head) # <head><title>The Dormouse's story</title></head>
print(soup.title) # <title>The Dormouse's story</title>

标签树子节点遍历

属性	说明
.contents	子节点的列表，将 Tag 所有儿子节点存入列表
.children	子节点的迭代类型，与.contents类似，用于循环遍历儿子节点
.descendants	子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

# 接上面的文档示例
print(soup.head.contents)  # [<title>The Dormouse's story</title>]

# 只会遍历子节点
for child in soup.body.children:
    print(child)

# 会遍历子孙节点，包括字符串
for child in soup.body.descendants:
    print(child)

标签树父节点遍历

属性	说明
.parent	节点的父亲标签
.parents	节点先辈标签的迭代类型，用于循环遍历先辈节点

print(soup.title.parent) # <head><title>The Dormouse's story</title></head>
for parent in soup.a.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)
# p
# body
# html
# [document]

兄弟节点遍历

属性	说明
.next_sibling	返回按照 HTML 文本顺序的下一个平行节点标签
.previous_sibling	返回按照 HTML 文本顺序的上一个平行节点标签
.next_siblings	迭代类型，返回按照 HTML 文本顺序的后续所有平行节点标签
.previous_siblings	迭代类型，返回按照 HTML 文本顺序的前续所有平行节点标签

sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>", "html.parser")
print(sibling_soup.b.next_sibling)  # <c>text2</c>
print(sibling_soup.c.previous_sibling)  # <b>text1</b>

for sibling in soup.b.next_siblings:
    print(sibling) # 遍历所有后续节点

for sibling in soup.b.previous_siblings:
    print(sibling) # 遍历所有前续节点

基于 bs4 库的 HTML 格式输出

prettify() 方法

soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>", "html.parser")
print(soup.prettify())
# <a>
#  <b>
#   text1
#  </b>
#  <c>
#   text2
#  </c>
# </a>

搜索文档

find_all()

返回一个列表类型

参数：

name 对标签名称的检索字符串

1 2	soup.find_all('a') # 返回所有的标签 a soup.find_all(['a', 'b']) # 返回所有的标签 a 和 b

attrs 对标签属性值的检索字符串，可标注属性检索

1
2
3

soup.find_all('p', 'course') # 返回所有的标签 p 且属性中包含 course
soup.find_all(id='link') # 返回满足指定属性值的所有标签
soup.find_all(id= re.compile('link')) # 返回满足指定属性值匹配正则的所有标签

recursive 是否对子孙全部检索，默认 True

1	soup.find_all('a', recursive = False) # 不对子孙节点进行搜索

string <>…</> 中字符串区域的检索字符串

1 2	soup.find_all(string="Python") # 只返回所有等于 Python 的字符串 soup.find_all(string=re.compile('Python')) # 返回所有包含 Python 的字符串

True

True 可以匹配任何值,下面代码查找到所有的 tag, 但是不会返回字符串节点

for tag in soup.find_all(True):
    print(tag.name)
# html
# head
# title
# body
# p
# b
# p
# a
# a
# a
# p

扩展方法

Tag(..) 等价于 Tag.find_all(..)

soup(..) 等价于 soup.find_all(..)

方法	说明
<>.find()	搜索且只返回一个结果，同 .find_all() 参数
<>.find_parents()	在先辈节点中搜索，返回列表类型，同 .find_all() 参数
<>.find_parent()	在先辈节点中返回一个结果，同 .find() 参数
<>.find_next_siblings()	在后续平行节点中搜索，返回列表类型，同 .find_all() 参数
<>.find_next_sibling()	在后续平行节点中返回一个结果，同 .find() 参数
<>.find_previous_siblings()	在前序平行节点中搜索，返回列表类型，同 .find_all() 参数
<>.find_previous_sibling()	在前序平行节点中返回一个结果，同 .find() 参数

示例

中国大学排名定向爬虫

import requests
from bs4 import BeautifulSoup
import bs4


def getHTMLText(url):
    try:
        r = requests.get(url)
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""


def fillUnivList(uList, html):
    soup = BeautifulSoup(html, "html.parser")
    tbody = soup.find('tbody')
    for tr in tbody.children:
        # if tr.name == None:
        #     continue
        if isinstance(tr, bs4.element.Tag):
            tds = tr('td')
            uList.append([tds[0].string, tds[1].string, tds[3].string])


def printUnivList(uList, num):
    # 当中文字符宽度不够时，采用西文字符填充；中西文字符占用宽度不同
    # 全部采用中文字符的空格填充 chr(12288) 可以消除这种问题
    template = "{0:^10}\t{1:{3}^10}\t{2:^10}"
    print(template.format("排名", "学校名称", "总分", chr(12288)))
    for n in range(0, num):
        u = uList[n]
        print(template.format(u[0], u[1], u[2], chr(12288)))


def main():
    uInfo = []
    url = "http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html"
    html = getHTMLText(url)
    fillUnivList(uInfo, html)
    printUnivList(uInfo, 10)


main()

#     排名      　　　学校名称　　　      总分
#     1       　　　清华大学　　　     95.9
#     2       　　　北京大学　　　     82.6
#     3       　　　浙江大学　　　      80
#     4       　　上海交通大学　　     78.7
#     5       　　　复旦大学　　　     70.9
#     6       　　　南京大学　　　     66.1
#     7       　中国科学技术大学　     65.5
#     8       　哈尔滨工业大学　　     63.5
#     9       　　华中科技大学　　     62.9
#     10      　　　中山大学　　　     62.1