标签:第三次 item url self 实践 采集 print data class
第三次作业
一、作业内容
-
作业①:
-
要求:指定一个网站,爬取这个网站中的所有的所有图片,例如中国气象网(http://www.weather.com.cn)。分别使用单线程和多线程的方式爬取。(限定爬取图片数量为学号后3位)
-
输出信息:
将下载的Url信息在控制台输出,并将下载的图片存储在result子文件夹中,并给出截图。
-
单线程实现过程:
① BeautifulSoup解析网页:
headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36"} req = urllib.request.Request(url, headers=headers) data = urllib.request.urlopen(req) data = data.read() dammit = UnicodeDammit(data, ["utf-8", "gbk"]) data = dammit.unicode_markup soup = BeautifulSoup(data, "lxml")
② 翻页处理的函数:
由上图可以看出,城市信息可以以列表形式存储在dd a 节点下的
def getNextPage(data, num): # 翻页处理 urls = [] # 保存城市的url citys = [] # 保存城市名称 for i in range(num): city = data.select("dd a")[i].text # 找到城市名称 url = data.select("dd a")[i]["href"] # 找到跳转到该城市页面的url citys.append(city) urls.append(url) return citys, urls
③ 存储图片的函数:
''' 函数功能:保存图片到本地 参数说明: url : 网页地址 path : 存储路径 ''' def save(path, url): try: r = requests.get(url) f = open(path, 'wb') f.write(r.content) f.close() except Exception as err: print(err)
④ 主函数:
Url = "http://www.weather.com.cn" # 首页 data = getHTMLText(Url) num = 11 citys, urls = getNextPage(data, num) # urls存储跳转到各个城市页面的url count = 1 if count <= 425: # 学号尾数为425,故爬取425张 for i in range(len(urls)): # 遍历每个页面 url = urls[i] city = citys[i] print("正在爬取" + str(city) + "的天气信息相关图片") data = getHTMLText(url) images = data.select("img") # 提取含有图片url的信息,返回的是一个列表 for j in range(len(images)): # 对于每一个图片u image_url = images[j]["src"] # src提取url print(image_url) # 打印图片的url file = r"E:/PycharmProjects/DataCollecction/test/3/1.1_result/" + str(count) + r".jpg" # 存储路径 save(file, image_url) count += 1 # count加一 print("总共爬取%d张图片"%count)
⑤ 结果展示:
-
多线程实现过程:
① imageSpider函数:
def imageSpider(start_url): global threads global count try: urls = [] req = urllib.request.Request(start_url, headers=headers) data = urllib.request.urlopen(req) data = data.read() dammit = UnicodeDammit(data, ["utf-8", "gbk"]) data = dammit.unicode_markup soup = BeautifulSoup(data, "lxml") images = soup.select("img") # 提取含有图片url的信息,返回的是一个列表 for image in images: try: src = image["src"] # src提取url url = urllib.request.urljoin(start_url, src) if url not in urls: print(url) count = count + 1 T = threading.Thread(target=download, args=(url, count)) T.setDaemon(False) T.start() threads.append(T) except Exception as err: print(err) except Exception as err: print(err)
② download函数:
def download(url,count): try: if(url[len(url)-4]=="."): ext = url[len(url)-4:] else: ext = "" req = urllib.request.Request(url, headers=headers) data = urllib.request.urlopen(req, timeout=100) data = data.read() fobj = open("E:/PycharmProjects/DataCollecction/test/3/1.2_result/"+str(count)+ext, "wb") # 路径及图片名称 fobj.write(data) fobj.close() print("downloaded "+str(count)+ext) except Exception as err: print(err)
③ 主函数:
start_url = "http://www.weather.com.cn" # 首页 # 设置请求头 headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36"} count=0 # 爬取图片数量 req = urllib.request.Request(start_url, headers=headers) data = urllib.request.urlopen(req) data = data.read() dammit = UnicodeDammit(data, ["utf-8", "gbk"]) data = dammit.unicode_markup soup = BeautifulSoup(data, "lxml") num = 11 citys = soup.select("dd a") urls = soup.select("dd a") threads=[] for i in range(len(urls)): if count > 425: break else: url = urls[i]["href"] city = citys[i].text print("正在爬取" + str(city) + "的天气信息相关图片") imageSpider(url) for t in threads: t.join() print("The End")
④ 结果展示:
由于是多线程,所以下载图片是没有按照爬取顺序的
-
-
-
作业②
-
要求:使用scrapy框架复现作业①。
-
输出信息:
同作业①
-
实现过程:
① items.py:
import scrapy class Demo2Item(scrapy.Item): img_url = scrapy.Field() # 用来存储图片url pass
② piplines.py:(与作业①的存储函数起到一样的作用)
import urllib class Demo2Pipeline: count = 0 def process_item(self, item, spider): try: for url in item["img_url"]: print(url) if (url[len(url) - 4] == "."): ext = url[len(url) - 4:] else: ext = "" req = urllib.request.Request(url) data = urllib.request.urlopen(req, timeout=100) data = data.read() fobj = open("E:/PycharmProjects/DataCollecction/test/3/demo_2/result/" + str(Demo2Pipeline.count) + ext, "wb") fobj.write(data) fobj.close() print("downloaded " + str(Demo2Pipeline.count) + ext) Demo2Pipeline.count += 1 except Exception as err: print(err) return item
③ setting.py
DEFAULT_REQUEST_HEADERS = { # 设置请求头 'accept': 'image/webp,*/*;q=0.8', 'accept-language': 'zh-CN,zh;q=0.8', 'referer': 'http://www.weather.com.cn', 'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36" }
④ MySpider.py(与作业①主函数起到一样的作用,爬取图片信息)
图片信息:
dammit = UnicodeDammit(response.body, ["utf-8", "gbk"]) data = dammit.unicode_markup selector = scrapy.Selector(text=data) img_url = selector.xpath("//img/@src").extract() MySpider.count += len(img_url) item = Demo2Item() item["img_url"] = img_url yield item
翻页处理:
if MySpider.count <= 425: url = selector.xpath("//dd/a/@href").extract()[MySpider.page] # 得到下一个页面url city = selector.xpath("//dd/a/text()").extract()[MySpider.page] print("正在爬取" + str(city) + "的天气信息相关图片") yield scrapy.Request(url=url, callback=self.parse) # 回调函数翻页 MySpider.page += 1 # 更新页码
⑤ 结果展示:
-
心得体会:
-
-
-
作业③:
-
要求:爬取豆瓣电影数据使用scrapy和xpath,并将内容存储到数据库,同时将图片存储在imgs路径下。
-
输出信息:
序号 电影名称 导演 演员 简介 电影评分 电影封面 1 肖申克的救赎 弗兰克·德拉邦特 蒂姆·罗宾斯 希望让人自由 9.7 ./imgs/xsk.jpg 2.... -
实现过程:
① items.py:
import scrapy class Demo3Item(scrapy.Item): num = scrapy.Field() # 序号 name = scrapy.Field() # 电影名称 director = scrapy.Field() # 导演 actor = scrapy.Field() # 演员 introduction = scrapy.Field() # 简介 score = scrapy.Field() # 电影评分 cover = scrapy.Field() # 电影封面url pass
② piplines.py:
将信息存储到数据库:
def open_spider(self, spider): print("opened") try: self.con = pymysql.connect(host="127.0.0.1", port=3306, user="root", passwd="qwe1346790", db="mydb", charset="utf8") self.cursor = self.con.cursor(pymysql.cursors.DictCursor) self.cursor.execute("delete from movies") self.opened = True self.count = 0 except Exception as err: print(err) self.opened = False def close_spider(self, spider): if self.opened: self.con.commit() self.con.close() self.opened = False print("closed") print("总共爬取", self.count, "条电影信息") def process_item(self, item, spider): try: print("序号: ", end="") print(item["num"]) # 打印电影序号 print("电影名称: ", end="") print(item["name"]) # 打印电影名称 print("导演: ", end="") print(item["director"]) # 打印电影导演 print("演员: ", end="") print(item["actor"]) # 打印演员 print("简介: ", end="") print(item["introduction"]) # 打印简介 print("电影评分: ", end="") print(item["score"]) # 打印电影评分 print("电影封面: ", end="") print(item["cover"]) # 打印电影封面 print() if self.opened: self.cursor.execute("insert into movies (num, name, director, actor, introduction, score, cover) values( % s, % s, % s, % s, % s, % s, % s)" , (item["num"], item["name"], item["director"], item["actor"], item["introduction"], item["score"], item["cover"],)) self.count += 1
保存图片到本地:
def process_item(self, item, spider): try: url = item["cover"] req = urllib.request.Request(url) data = urllib.request.urlopen(req, timeout=100) data = data.read() fobj = open("E:/PycharmProjects/DataCollecction/test/3/demo_3/result/" + str(self.count) + ".jpg", "wb") fobj.write(data) fobj.close() print("downloaded " + str(self.count) + ".jpg") self.count += 1
③ setting.py:
ITEM_PIPELINES = { 'demo_3.pipelines.Demo3Pipeline': 300, } DEFAULT_REQUEST_HEADERS = { # 设置请求头 'accept': 'image/webp,*/*;q=0.8', 'accept-language': 'zh-CN,zh;q=0.8', 'referer': 'https://movie.douban.com/top250', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36', 'Cookie': 'll="108300"; bid=suhTDj9VAic; douban-fav-remind=1; __gads=ID=91b3b86e72bb1bdc-22f833a798cb00ff:T=1631273458:RT=1631273458:S=ALNI_Mb_chXR7p5DDTuCl3S6BX87GbcBTw; viewed="4006425"; gr_user_id=125cd979-ed5d-41e6-9bb0-f4e248d2faab; __utmz=30149280.1631273897.1.1.utmcsr=edu.cnblogs.com|utmccn=(referral)|utmcmd=referral|utmcct=/; __utmz=223695111.1635338543.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmc=30149280; __utmc=223695111; _pk_ses.100001.4cf6=*; __utma=30149280.2089450980.1631273897.1635427621.1635435215.5; __utmb=30149280.0.10.1635435215; __utma=223695111.1984107456.1635338543.1635427621.1635435215.4; __utmb=223695111.0.10.1635435215; ap_v=0,6.0; _pk_id.100001.4cf6=251801600b05f8b9.1635338543.4.1635437309.1635429350.', }
④ MySpider.py:
def start_requests(self): url = MySpider.source_url # 第一页的URL yield scrapy.Request(url=url, callback=self.parse) # 回调下一个函数
解析网页:
dammit = UnicodeDammit(response.body, ["utf-8", "gbk"]) data = dammit.unicode_markup selector = scrapy.Selector(text=data)
找到存储电影的节点:
lis = selector.xpath("//ol[@class='grid_view']/li") # 找到储存电影信息的节点
找到每个节点中我们想要的信息:
for li in lis: num = li.xpath("./div[@class='item']/div[@class='pic']/em/text()").extract_first() # 序号 name = li.xpath("./div[@class='item']/div[@class='pic']/a/img/@alt").extract_first() # 电影名称 # 包含主演和导演的信息 source = li.xpath("./div[@class='item']/div[@class='info']/div[@class='bd']/p[@class='']/text()").extract_first() if len(source.split(':')) > 2: # 以:分隔,当分隔出来的单位大于2时,主演内容不为空 director = source.split(':')[1].split('主演')[0].split(' ')[1].strip() # 提取导演信息 actor = source.split(':')[2].split('/')[0].split(' ')[1].strip() # 提取主演信息 else: # 以:分隔,当分隔出来的单位小于等于2时,主演内容为空 director = source.split(':')[1].split('/')[0].split(' ')[1].strip() actor = ' ' # 简介 introduction = li.xpath("./div[@class='item']/div[@class='info']/div[@class='bd']/p[@class='quote']/span/text()").extract_first() # 评分 score = li.xpath("./div[@class='item']/div[@class='info']/div[@class='bd']/div[@class='star']/span[@class='rating_num']/text()").extract_first() # 封面图片的url cover = li.xpath("./div[@class='item']/div[@class='pic']/a/img/@src").extract_first() item = Demo3Item() # 将数据导入items里面 item["num"] = num item["name"] = name item["director"] = director item["actor"] = actor item["introduction"] = introduction item["score"] = score item["cover"] = cover yield item MySpider.count += 1
翻页处理:
由上图看出,下一页的hrefl + 首页url 可以得到完整的url:
next_link = selector.xpath("//div[@class='article']/div[@class='paginator']/a/@href").extract() # next_url = MySpider.source_url + next_link[MySpider.page - 1] if MySpider.page < 5: print("现在正在爬取第" + str(MySpider.page) + "页.......") print(next_url) yield scrapy.Request(url=next_url, callback=self.parse) # 回调函数翻页 MySpider.page += 1 # 更新页码
-
心得体会*
在作业三中,遇到的问题主要是导演与主演信息存在一个text里,需要对字符串进行分割:
我们首先用xpath方法找到包含此text的字符串:(代码实现)
source = li.xpath("./div[@class='item']/div[@class='info']/div[@class='bd']/p[@class='']/text()").extract_first()
再观察,对于某些电影是有主演的,而有些电影是没有主演的
故我们需要进行细节处理:
利用source.split(':')对source进行分割,对于有主演的电影,分割后的段数大于2,而对于无主演的电影,分割后的段数小于等于2;
if len(source.split(':')) > 2: # 以:分隔,当分隔出来的单位大于2时,主演内容不为空 director = source.split(':')[1].split('主演')[0].split(' ')[1].strip() # 提取导演信息 actor = source.split(':')[2].split('/')[0].split(' ')[1].strip() # 提取主演信息 else: # 以:分隔,当分隔出来的单位小于等于2时,主演内容为空 director = source.split(':')[1].split('/')[0].split(' ')[1].strip() actor = ' '
注:split(' ')[1]是为了分开英文名与中文名并输出中文名
附:完整代码链接
-
-
标签:第三次,item,url,self,实践,采集,print,data,class 来源: https://www.cnblogs.com/chihiro20/p/15490688.html
本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享; 2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关; 3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关; 4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除; 5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。