用selenium爬取csdn博客文章，并用4种方法提取数据

2020-06-16 22:41:00 阅读：231 来源： 互联网

标签：title text selenium 爬取 csdn time article class

为了方便susu学习selenium，下面代码用selenium爬取博客文章的标题和时间，并用selenium自带的解析，etree，bs4，scrapy框架自带的selector等4种方式来解析网页数据；

当然，请求库还可以使用urllib，requests；也可以用aiohttp来实现异步爬取，用Splash实现动态渲染页面的抓取。

# -*- encoding: utf-8 -*-

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

from lxml import etree
from bs4 import BeautifulSoup
from scrapy import Selector


def selenium_test(url):
    # 设置无头浏览器，字符编码，请求头等信息，防止反爬虫检测
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('lang=zh_CN.UTF-8')
    UserAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36'
    chrome_options.add_argument('User-Agent=' + UserAgent)
    browser = webdriver.Chrome(chrome_options=chrome_options)
    browser.get(url)

    print('用4种方式提取标题和时间：')
    # 解析数据 方式一：用selenium自带的解析
    title = browser.find_element_by_xpath('//h1[@class="title-article"]').text
    publish_time = browser.find_element_by_xpath('//div[@class="bar-content"]/span[@class="time"]').text
    print(' 方式一 用selenium自带的解析: ', title, publish_time)

    # 解析数据 方式二：用etree
    selector = etree.HTML(browser.page_source)
    title = selector.xpath('//h1[@class="title-article"]/text()')[0]
    publish_time = selector.xpath('//div[@class="bar-content"]/span[@class="time"]/text()')[0]
    print(' 方式二 用etree解析：', title, publish_time)

    # 解析数据 方式三：用beautifulsoup
    soup = BeautifulSoup(browser.page_source, 'lxml')
    title = soup.find('h1', {'class': 'title-article'}).text
    publish_title = soup.find('div', {'class': 'bar-content'}).find('span', {'class': 'time'}).text
    print(' 方式三：用beautifulsoup解析：', title, publish_title)

    # 解析数据 方式四：用scrapy框架中的选择器Selector
    selector = Selector(text=browser.page_source)
    title = selector.xpath('//h1[@class="title-article"]/text()').extract_first()
    publish_time = selector.xpath('//div[@class="bar-content"]/span[@class="time"]/text()').extract_first()
    article_list = selector.xpath('//div[@class="markdown_views prism-atom-one-dark"]').extract()
    article = ''.join(article_list) if len(article_list) > 0 else None
    print(' 方式四：用scrapy框架中的选择器Selector解析：', title, publish_time)

    # 可以把博客文章保存到本地，然后用浏览器打开，会发现博客文章和网页上的结构是一样的
    # with open('article.html', 'w') as f:
    #     f.write(article)


selenium_test(url='https://blog.csdn.net/cui_yonghua/article/details/90512943')

执行结果如下图：
在这里插入图片描述

标签：title,text,selenium,爬取,csdn,time,article,class
来源： https://blog.csdn.net/cui_yonghua/article/details/106765193

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

用selenium爬取csdn博客文章，并用4种方法提取数据