Python爬虫之bs4，非常详细

2022-08-17 13:00:08 阅读：182 来源： 互联网

标签：title Python 标签爬虫 BeautifulSoup bs4 html res print

Python爬虫之bs4，非常详细

bs4 全名 BeautifulSoup，是编写 python 爬虫常用库之一，主要用来解析 html 标签。

一、初始化

pip install bs4

from bs4 import BeautifulSoup

soup = BeautifulSoup("<html>A Html Text</html>", "html.parser")

两个参数：第一个参数是要解析的html文本，第二个参数是使用那种解析器，对于HTML来讲就是html.parser，这个是bs4自带的解析器。

如果一段HTML或XML文档格式不正确的话，那么在不同的解析器中返回的结果可能是不一样的。

解析器	使用方法	优势
Python标准库	BeautifulSoup(html, "html.parser")	1、Python的内置标准库 2、执行速度适中 3、文档容错能力强
lxml HTML	BeautifulSoup(html, "lxml")	1、速度快 2、文档容错能力强
lxml XML	BeautifulSoup(html, ["lxml", "xml"]) BeautifulSoup(html, "xml")	1、速度快 2、唯一支持XML的解析器
html5lib	BeautifulSoup(html, "html5lib")	1、最好的容错性 2、以浏览器的方式解析文档 3、生成HTML5格式的文档

格式化输出

soup.prettify()  # prettify 有括号和没括号都可以

二、基本使用

from bs4 import BeautifulSoup

# 构造一个网页数据
html_doc = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="title">
            <b>The Dormouse's story</b>
        </p>
        
        <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
        and they lived at the bottom of a well.</p>
        
        <p class="story">...</p>
    </body>
</html>
"""

2.1 获取标签

res = BeautifulSoup(html_doc, 'lxml')

print(res.a)

2.2 获取标签内文本

print(res.a.text)

2.3 获取标签内属性

print(res.a.attrs)

2.4 获取指定属性值

print(res.a.attrs.get('href'))
print(res.a.get('href'))

2.5 获取子节点

for i in res.p.children:
    print(i)

2.6 获取标签内部所有的元素

print(res.p.contents)

2.7 获取标签的父标签

print(res.p.parent)

2.8 获取最上级节点

for i in res.p.parents:
    print(i)

三、bs4核心库

3.1 find

只能找符合条件的第一个该方法的返回结果是一个标签对象

3.1.1 查找指定标签名的标签默认只找符合条件的第一个

print(res.find(name='p'))

3.1.2 查找具有某个特定属性的标签默认只找符合条件的第一个

print(res.find(name='p', id='title'))

3.1.3 为了解决关键字冲突会加下划线区分

print(res.find(name='p', class_='title'))

3.1.4 使用attrs参数直接避免冲突

print(res.find(name='p', attrs={'class': 'title'}))

3.2 find_all

查找所有符合条件的标签该方法的返回结果是一个列表。

3.2.1 查询某一个标签，查找的结果是一个列表

print(res.find_all('a'))

3.3 select方法

使用css选择器该方法的返回结果是一个列表。

3.3.1 查找class含有title的标签

print(res.select('.title'))

3.3.2 查看class含有sister标签内部所有的后代span

print(res.select('.title b'))

3.3.3 查找id等于title的标签

print(res.select('#title'))

四、使用bs4爬取豆瓣电影排行榜

from bs4 import BeautifulSoup
import requests
import re

def main():

    head = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
    }

    baseurl = "https://movie.douban.com/top250?start="

    res = requests.get(url=baseurl, headers=head)

    connect = res.text

    res = BeautifulSoup(connect, 'lxml')

    video = res.select('.grid_view li')

    list = []

    for i in video:

        vidow = {
            "title": "",
            "year": "",
            "score": 0,
            "num": 0
        }

        for item in i.select('.title'):
            vidow['title'] += item.text.replace("\xa0", " ")

        for item in i.select('.other'):
            vidow['title'] += item.text.replace("\xa0", " ")

        for item in i.select(".bd p"):
            obj = re.compile('\d{4}', re.S)
            result = obj.finditer(item.text)
            for year in result:
                vidow['year'] = year.group()

        for item in i.select(".rating_num"):
            vidow['score'] = item.text

        vidow['num'] = i.select(".star span")[-1].text.replace("人评价", "")

        list.append(vidow)

    print(list)
if __name__ == '__main__':

    main()

标签：title,Python,标签,爬虫,BeautifulSoup,bs4,html,res,print
来源： https://www.cnblogs.com/chenyangqit/p/16594745.html

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

Python爬虫之bs4，非常详细

Python爬虫之bs4，非常详细

一、初始化

二、基本使用

2.1 获取标签

2.2 获取标签内文本

2.3 获取标签内属性

2.4 获取指定属性值

2.5 获取子节点

2.6 获取标签内部所有的元素

2.7 获取标签的父标签

2.8 获取最上级节点

三、bs4核心库

3.1 find

3.1.1 查找指定标签名的标签 默认只找符合条件的第一个

3.1.2 查找具有某个特定属性的标签 默认只找符合条件的第一个

3.1.3 为了解决关键字冲突 会加下划线区分

3.1.4 使用attrs参数 直接避免冲突

3.2 find_all

3.2.1 查询某一个标签，查找的结果是一个列表

3.3 select方法

3.3.1 查找class含有title的标签

3.3.2 查看class含有sister标签内部所有的后代span

3.3.3 查找id等于title的标签

四、使用bs4爬取豆瓣电影排行榜

3.1.1 查找指定标签名的标签默认只找符合条件的第一个

3.1.2 查找具有某个特定属性的标签默认只找符合条件的第一个

3.1.3 为了解决关键字冲突会加下划线区分

3.1.4 使用attrs参数直接避免冲突