标签:name img url self 爬取 item scrapy 图片
- 首先创建好我们得项目
--scrapy startproject projectname
- 然后在创建你的爬虫启动文件
--scrapy genspider spidername
然后进入我们得settings文件下配置我们得携带参数
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36' ROBOTSTXT_OBEY = False LOG_LEVEL = 'ERROR'
我们也要创建一个存储图片得路径
我们先写启动文件spiders中得代码(不解释这些模块得用法,过于基础)
import scrapy
import requests
from img_downloads.items import ImgDownloadsItem
class ImgDownSpider(scrapy.Spider):
name = 'img_down'
# allowed_domains = ['http://pic.netbian.com/4kdongman/']
start_urls = ['http://pic.netbian.com/new/index.html']
url_model = 'http://pic.netbian.com/new/index_%s.html'
page_num = 2
def parse(self, response):
detail_list = response.xpath('//*[@id="main"]/div[3]/ul/li/a/@href').extract()
for detail in detail_list:
detail_url = 'http://pic.netbian.com' + detail
item = ImgDownloadsItem()
# print(detail_url)
yield scrapy.Request(detail_url, callback=self.parse_detail, meta={'item': item})
if self.page_num < 3:
new_url = self.url_model % self.page_num
print(new_url)
self.page_num += 1
yield scrapy.Request(new_url, callback=self.parse)
def parse_detail(self, response):
upload_img_url_list = response.xpath('//*[@id="img"]/img/@src').extract()
upload_img_name = response.xpath('//*[@id="img"]/img/@title').extract_first()
item = response.meta['item']
for upload_img_url in upload_img_url_list:
# print(upload_img_url)
img_url = 'http://pic.netbian.com' + upload_img_url
# print(img_url)
item['img_src'] = img_url
item['img_name'] = upload_img_name+'.jpg'
# print(upload_img_name)
yield item
然后进入我们得管道文件
from scrapy import Request
from scrapy.pipelines.images import ImagesPipeline
import scrapy
class ImgDownloadsPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
yield scrapy.Request(url=item['img_src'], meta={'item': item})
# 返回图片名称即可
def file_path(self,request,response=None,info=None):
item = request.meta['item']
img_name = item['img_name']
print(img_name,'下载成功!!!')
return img_name
def item_completed(self,results,item,info):
return item
最后回到settings中吧管道配置打开
ITEM_PIPELINES = { 'img_downloads.pipelines.ImgDownloadsPipeline': 300, }
标签:name,img,url,self,爬取,item,scrapy,图片 来源: https://www.cnblogs.com/ziweijun/p/13194310.html
本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享; 2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关; 3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关; 4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除; 5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。