你不知道的 Node.js 爬虫原来这么简单

2022-03-28 23:31:48 阅读：191 来源： 互联网

标签：Node function fs const res 爬虫 js html https

今天给大家带来的是node简单爬虫，对于前端小白也是非常好理解且会非常有成就感的小技能

爬虫的思路可以总结为：请求 url - > html（信息） -> 解析html

这篇文章呢，就带大家爬取豆瓣TOP250电影的信息

工具

爬虫必备工具：cheeriocheerio 简单介绍：cheerio 是 jquery 核心功能的一个快速灵活而又简洁的实现，主要是为了用在服务器端需要对 DOM 进行操作的地方。大家可以简单的理解为用来解析 html 非常方便的工具。使用之前只需要在终端安装即可 npm install cheerio

node爬虫步骤解析

一、选取网页url，使用http协议get到网页数据

豆瓣TOP250链接地址：https://movie.douban.com/top250

首先我们请求http协议，通过http来拿到网页的所有数据

const https = require('https');
https.get('https://movie.douban.com/top250',function(res){
    // 分段返回的 自己拼接
    let html = '';
    // 有数据产生的时候 拼接
    res.on('data',function(chunk){
        html += chunk;
    })
    // 拼接完成
    res.on('end',function(){
        console.log(html);
    })
})

上面代码呢，大家一定要注意我们请求数据时，拿到的数据是分段拿到的,我们需要通过自己把数据拼接起来

res.on('data',function(chunk){
        html += chunk;
    })

拼接完成时我们可以输出一下，看一下我们是否拿到了完整数据

res.on('end',function(){
        console.log(html);
    })

二、使用cheerio工具解析需要的内容

const cheerio = require('cheerio');
res.on('end',function(){
        console.log(html);
        const $ = cheerio.load(html);
        let allFilms = [];
        $('li .item').each(function(){
            // this 循环时 指向当前这个电影
            // 当前这个电影下面的title
            // 相当于this.querySelector 
            const title = $('.title', this).text();
            const star = $('.rating_num',this).text();
            const pic = $('.pic img',this).attr('src');
            // console.log(title,star,pic);
            // 存 数据库
            // 没有数据库存成一个json文件 fs
            allFilms.push({
                title,star,pic
            })
        })

可以通过检查网页源代码查看需要的内容在哪个标签下面，然后通过$符号来拿到需要的内容，这里我就拿了电影的名字、评分、电影图片

到了这时候，你会发现，node 爬虫实现是非常简单的，我们只需要认真分析一下我们拿到的 html 数据，将需要的内容拿出来保存在本地就基本完成了

保存数据

下面就是保存数据了，我将数据保存在 films.json 文件中将数据保存到文件中，我们引入一个fs模块，将数据写入文件中去

const fs = require('fs');
fs.writeFile('./films.json', JSON.stringify(allFilms),function(err){
            if(!err){
                console.log('文件写入完毕');
            }
        })

文件写入代码需要写在 res.on('end') 里面，数据读完->写入写入完成，可以查看一下 films.json，里面是有爬取的数据的。

下载图片

我们爬取的图片数据是图片地址，如果我们要将图片保存到本地呢？这时候只需要跟前面请求网页数据一样，把图片地址url请求回来，每一张图片写入到本地即可

function downloadImage(allFilms) {
    for (let i = 0; i < allFilms.length; i++) {
        const picUrl = allFilms[i].pic;
        // 请求 -> 拿到内容 // fs.writeFile('./xx.png','内容')
        https.get(picUrl, function (res) {
            res.setEncoding('binary');
            let str = '';
            res.on('data', function (chunk) {
                str += chunk;
            })
            res.on('end', function () {
                fs.writeFile(`./images/${i}.png`, str, 'binary', function (err) {
                    if (!err) {
                        console.log(`第${i}张图片下载成功`)
                    }
                })
            })
        })
    }
}

下载图片的步骤跟爬取网页数据的步骤是一模一样的，我们将图片的格式保存为.png写好了下载图片的函数，我们在 res.on('end') 里面调用一下函数就大功告成了

源码

// 请求 url - > html（信息）  -> 解析html
const https = require('https');
const cheerio = require('cheerio');
const fs = require('fs');
// 请求 top250
// 浏览器输入一个 url, get
https.get('https://movie.douban.com/top250',function(res){
    // console.log(res);
    // 分段返回的 自己拼接
    let html = '';
    // 有数据产生的时候 拼接
    res.on('data',function(chunk){
        html += chunk;
    })
    // 拼接完成
    res.on('end',function(){
        console.log(html);
        const $ = cheerio.load(html);
        let allFilms = [];
        $('li .item').each(function(){
            // this 循环时 指向当前这个电影
            // 当前这个电影下面的title
            // 相当于this.querySelector 
            const title = $('.title', this).text();
            const star = $('.rating_num',this).text();
            const pic = $('.pic img',this).attr('src');
            // console.log(title,star,pic);
            // 存 数据库
            // 没有数据库存成一个json文件 fs
            allFilms.push({
                title,star,pic
            })
        })
        // 把数组写入json里面
        fs.writeFile('./films.json', JSON.stringify(allFilms),function(err){
            if(!err){
                console.log('文件写入完毕');
            }
        })
        // 图片下载一下
        downloadImage(allFilms);
    })
})

function downloadImage(allFilms) {
    for(let i=0; i<allFilms.length; i++){
        const picUrl = allFilms[i].pic;
        // 请求 -> 拿到内容
        // fs.writeFile('./xx.png','内容')
        https.get(picUrl,function(res){
            res.setEncoding('binary');
            let str = '';
            res.on('data',function(chunk){
                str += chunk;
            })
            res.on('end',function(){
                fs.writeFile(`./images/${i}.png`,str,'binary',function(err){
                    if(!err){
                        console.log(`第${i}张图片下载成功`);
                    }
                })
            })
        })
    }
}

总结

　　爬虫不是只有 python 才行的，我们 node 也很方便简单，前端新手掌握一个小技能也是非常不错的，对自身的 node 学习有很大的帮助，本文的爬虫技巧只是入门，感兴趣小伙伴可以继续探究。

转自:微信公众号： Nodejs技术栈

https://mp.weixin.qq.com/s/HYISusLzIqdBWaBjojrpjg

标签：Node,function,fs,const,res,爬虫,js,html,https
来源： https://www.cnblogs.com/tjp40922/p/16069731.html

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9