ICode9

精准搜索请尝试: 精确搜索
首页 > 其他分享> 文章详细

爬取京东商城的商品数据

2020-03-19 23:53:42  阅读:147  来源: 互联网

标签:product return String int 爬取 import 京东 public 商城


其实,若不考虑反爬虫技术,正儿八经的爬虫技术没有什么太多的技术含量,这里只是将这次爬取数据的过程做个简单的备忘,在Conv-2019的特别日子里,不能到公司职场工作,在家远程,做一些调研和准备工作。这里头,就有产品市场调研这块,数据说话!

 

我重点爬取了京东商城的数据,当然,早期也爬取了天猫和淘宝的数据(阿里系列,反爬虫技术还是比较厉害,后来频繁提示滑动条,这个绕不过去,即便程序中监测到跳出来了滑动条验证,然后我手动验证都不让过,这的确比较厉害,目前因为没有多少时间深入调研,没有弄清楚这个到底怎么绕过去,若有过来人,还请告知一二!!!)

 

我的爬取过程,技术采用的是selenium+httpclient+mysql实现的。

  • selenium是一款自动化测试工具,在这里,很好的用来设计自动化的点击页面按钮的动作。说实在话,不用selenium,完全用jsoup也是可以搞得定的。但是,完全用selenium,可能有些场景就不是那么好搞定了。涉及到完全异步操作的时候,selenium的模拟点击页面,不管通过cssSelector还是xpath等,都可能遇到元素不存在的错误。
  • 完全用jsoup是可以解决问题的,只不过呢,完全用jsoup,这个爬虫的程序就相对比较复杂一些了,自己要写很多的代码。
  • 所以,我最终采用了selenium和httpclient爬取数据。selenium模拟翻页,因为京东商城的商品列表页面,是有明确的规律的。不管是参数翻页(WebElement.click(href)这种模式),还是基于模拟点击列表页面的"下一页",都是比较轻松的事情,而且,针对要爬取的页面,还有web页面被打开,可以看到一个大概的视图。httpclient在这里,主要用来获取商品的价格和评论数据,价格是辅助获取,评论数据是完全依靠httpclient。

 

先创建一个爬虫程序的maven工程,主要是为了方便拉取依赖包。

<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.6</version>
</dependency>
<dependency>
    <groupId>c3p0</groupId>
<artifactId>c3p0</artifactId>
<version>0.9.1.2</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-chrome-driver</artifactId>
<version>3.141.59</version>
</dependency>

因为我这里selenium基于浏览器运行,即模拟浏览器的工作,所以,我选择的是客户端模式,谷歌浏览器驱动。所以,还要下载chrome的本地程序,可以理解为chrome的内核程序,在java工程程序中,系统参数中需要配置这个chrome浏览器内核,通过java的JNI工作模式,进行模拟控制操作浏览器打开页面的过程。

 

整个java工程就是一个非常基本的main程序,普通的maven项目,读者可以按照自己的需求,设计成web模式也是可以的。先来看看配置selenium的部分。

JDSeleniumFullProxy
package com.shihuc.up.spider.jd.comment;

import com.google.common.collect.Lists;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;

import java.util.List;
import java.util.concurrent.TimeUnit;

public class JDSeleniumFullProxy {

    public static ChromeDriver driver;

    static {
        try {
            //启动浏览器
            getDriver();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }
    public static void main(String[] args) throws InterruptedException {
        getProductsWithFullScenario();

        Thread.sleep(10000);
        System.out.println("!!!!!!!==========Well Done===========!!!!!!");

        //关闭模拟器
        driver.quit();
    }

    private static void getProductsWithFullScenario() {
        String urls[] = new String[] {
                /*车载手机支架*/
                "https://search.jd.com/Search?keyword=%E8%BD%A6%E8%BD%BD%E6%89%8B%E6%9C%BA%E6%94%AF%E6%9E%B6&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&click=0"
        };
        String products[][] = new String[][] {
                {"jd_info_czsjzj", "jd_comment_czsjzj"}
        };
        int hmp = 40;

        JDProductDao productDao = new JDProductDao();

        //爬取所需的数据
        for (int i=0; i < urls.length; i++) {
            JDSeleniumFullCrawler.getAllProducts(driver, hmp, urls[i], productDao, products[i]);
        }

        //将价格和销量做适当的处理(价格有范围的,销量中有‘万’或者 ‘+’的,处理为数值)
        for (int i=0; i<products.length; i++) {
            productDao.updateProductForPriceSells(products[i][0]);
        }
    }

    /**
     * 获取 ChromeDriver
     * @throws InterruptedException
     */
    private static void getDriver() throws InterruptedException{
        String os = System.getProperty("os.name");
        if (os.toLowerCase().startsWith("win")) {
            System.setProperty("webdriver.chrome.driver",
                    System.getProperty("user.dir") + "\\chromedriver_win32\\chromedriver.exe");
        } else {
            System.setProperty("webdriver.chrome.driver", "/usr/bin/chromedriver");
        }
        ChromeOptions options = new ChromeOptions();
        // 关闭界面上的---Chrome正在受到自动软件的控制
        options.addArguments("--disable-infobars");
        // 允许重定向
        options.addArguments("--disable-web-security");
        // 最大化
        options.addArguments("--start-maximized");
        options.addArguments("--no-sandbox");
        List<String> excludeSwitches = Lists.newArrayList("enable-automation");
        options.setExperimentalOption("excludeSwitches", excludeSwitches);

        driver = new ChromeDriver(options);
        driver.manage().timeouts().implicitlyWait(5, TimeUnit.SECONDS);
        //driver.get("https://passport.jd.com/new/login.aspx");

/**
 * 下面这些模拟滑动条的方式,都不凑用,只有通过淘宝的登录页打开,然后手动切换到支付宝登录页,手机支付宝扫码
 * 这样方能绕过淘宝反爬虫的那个滑动条阻拦
 */
//        while(true) {
//            if(currentIsLoginPage()){
//                System.out.println("============>>>>");
//            }else {
//                System.out.println(">>>>>>OOOOOOOOOOO");
//                break;
//            }
//            Thread.sleep(2000);
//        }
    }

    private static boolean currentIsLoginPage() {
        String url = driver.getCurrentUrl();
        if (url.contains("https://passport.jd.com/new/login.aspx")){
            return true;
        }
        return false;
    }
}

代码中红色部分,是我的chrome驱动程序所在路径的配置,即chromedriver.exe文件在我的项目内文件夹chromedriver_win32里面。依据你下载这个文件时放的路径不同,这里有所调整。

 

上面程序中,也可以模拟程序登录的过程,因为京东商城浏览商品,不管怎么浏览都不要求登录,不想阿里系,浏览一下,还防爬,时不时蹦出来登录。。。鄙视。。。

 

接下来,就是真正操作selenium和jsoup爬取数据的过程了。

JDSeleniumFullCrawler
package com.shihuc.up.spider.jd.comment;

import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONObject;
import com.shihuc.up.spider.jd.opt.JDPhoneHolder;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;

import java.io.IOException;
import java.util.HashMap;
import java.util.List;
import java.util.Set;

public class JDSeleniumFullCrawler {

    private static String COMMENT_TOTAL = "评论总数";
    private static String COMMENT_GOOD = "好评数量";
    private static String COMMENT_GENERAL = "中评数量";
    private static String COMMENT_POOL = "差评数量";
    private static String COMMENT_VIDEO = "视频晒单";
    private static String COMMENT_AFTER = "追评数量";

    public static void getAllProducts(ChromeDriver driver, int howManyPages, String url, JDProductDao pdao, String []pname) {

        for (int i = 1; i <= howManyPages; i++) {
            getFullPageProducts(driver, i, url, pdao, pname);
            try {
                Thread.sleep(100);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }
    }

    public static void getFullPageProducts(ChromeDriver driver, int i, String rawUrl, JDProductDao pdao, String []pname) {
//        WebElement pageNumInput = driver.findElement(By.xpath("//*[@id=\"J_bottomPage\"]/span[2]/input"));
//        pageNumInput.clear();
//        pageNumInput.sendKeys(i + "");
//        WebElement searchSubmit = driver.findElement(By.xpath("//*[@id=\"J_bottomPage\"]/span[2]/a"));
//        searchSubmit.click();
        String url = rawUrl + "&page=" + (2*i - 1) + "&s=" + (60*(i-1) + 1);
        driver.get(url);
        getProductsProcess(driver, pdao, pname);
    }
    private static void getProductsProcess(ChromeDriver driver, JDProductDao pdao, String []pname) {
        List<WebElement> itemElements = driver.findElements(By.cssSelector("#J_goodsList .gl-item"));
        System.out.println(itemElements.size());
        String mainHandle = driver.getWindowHandle();
        String href = null;
        for(WebElement we: itemElements) {
            try {
                String weId = we.getAttribute("data-pid");
                //WebElement weHref = we.findElement(By.cssSelector(".p-name a"));
                WebElement weHref = we.findElement(By.cssSelector(".p-img a"));
                //href = weHref.getAttribute("href");
                href = "https://item.jd.com/" + weId + ".html";

                //价格和评论这么取取不到,网站是一个完全异步的显示逻辑
                String price = null;
                try {
                    WebElement wePrice = we.findElement(By.cssSelector(".p-price strong i"));
                    price = wePrice.getText();
                }catch (Exception ep) {
                    System.err.println("can not get the price information for pid " + weId + " ......");
                }
//                String sells = null;
//                try {
//                    WebElement weSells = we.findElement(By.cssSelector(".p-commit strong a"));
//                    sells = weSells.getText();
//                }catch (Exception ec) {
//                    System.err.println("can not get the comment information for pid " + weId + " ......");
//                }


                driver.executeScript("window.open(\"https://item.jd.com/" + weId + ".html\");");

                Set<String> handles = driver.getWindowHandles();
                String newHandle = "";
                for (String s : handles) {
                    if (s.equalsIgnoreCase(mainHandle)) {
                        continue;
                    }
                    newHandle = s;
                    break;
                }
                //将窗口调整到刚才打开的产品详情页窗口
                driver.switchTo().window(newHandle);

                //获取当前产品详情页的关注的产品详情信息
                JDProduct product = getJDProductInfos(driver);
                try {
                    if (price == null) {
                        price = JDPhoneHolder.getPrice(weId);
                    }
                } catch (IOException e) {
                    e.printStackTrace();
                }

                product.setUrl(href);
                product.setPid(weId);
                product.setPrice(price);

                //JDComment comment = getCommentByCD(driver);
                JDComment comment = getCommentByPID(weId);
                product.setComment(comment);
                int rid = pdao.addProductInfoGenId(product, pname[0]);
                pdao.addProductComments2(product, rid, pname[1]);

                //关闭当前处理的产品详情页窗口
                closeAllOtherWindows(mainHandle, driver);
            }catch(Exception eal) {
                closeAllOtherWindows(mainHandle, driver);
                eal.printStackTrace();
                System.out.println(href);
            }
        }
    }

    public static JDProduct getJDProductInfoByUrl(WebDriver driver, String url, JDProduct product) {
        System.out.println("URL: " + url);
        driver.get(url);

        WebElement weComment = driver.findElement(By.cssSelector(".comment-count .count"));
        WebElement wePrice = driver.findElement(By.cssSelector(".summary-price .price"));

        String strComment = weComment.getText();
        if (strComment.equalsIgnoreCase("0")){
            try {
                strComment = JDPhoneHolder.getCommitCountNum(product.getPid()) + "";
            } catch (IOException e) {
                e.printStackTrace();
            }
        }

        String strPrice = wePrice.getText();
        product.setPrice(strPrice);

        return product;
    }

    public static JDProduct getJDProductInfos(WebDriver driver) {
        WebElement weTitle = driver.findElement(By.cssSelector(".w div.sku-name"));
        String title = weTitle.getText();

        /**
         * 获取产品型号信息, 通过xpath获取信息的性能比cssSelector高很多
         */
        WebElement weBrand = driver.findElement(By.xpath(".//*[@id=\"parameter-brand\"]/li/a"));
        String brand = weBrand.getText();

        WebElement weName = driver.findElement(By.xpath(".//*[@id=\"detail\"]/div[2]/div[1]/div[1]/ul[2]/li[1]"));
        String name = weName.getText();
        name = name.replace("商品名称:","").trim();

        JDProduct product = new JDProduct();
        product.setBrand(brand);
        product.setPname(name);
        product.setTitle(title);
        return product;
    }

    public static JDComment getCommentByPID(String pid) {
        JDComment comments = new JDComment();
        HashMap<String, Integer> groups = new HashMap<>();
        try {
            JSONObject commentJson =JDPhoneHolder.getComments(pid);
            JSONObject productCommentSummary = commentJson.getJSONObject("productCommentSummary");
            //好评比例
            int goodRateShow = productCommentSummary.getInteger("goodRateShow");
            comments.setGoodRate(goodRateShow);

            //评论总数
            int commentCount = productCommentSummary.getInteger("commentCount");
            comments.setTotalc(commentCount);
            //好评数量
            int goodCount = productCommentSummary.getInteger("goodCount");
            comments.setGoodc(goodCount);
            //中评数量
            int generalCount = productCommentSummary.getInteger("generalCount");
            comments.setGeneralc(generalCount);
            //差评数量
            int poorCount = productCommentSummary.getInteger("poorCount");
            comments.setPoorc(poorCount);
            //视频晒单
            int videoCount = productCommentSummary.getInteger("videoCount");
            comments.setVideoc(videoCount);
            //追评数量
            int afterCount = productCommentSummary.getInteger("afterCount");
            comments.setAfterc(afterCount);

            JSONArray hotCommentTagStatistics = commentJson.getJSONArray("hotCommentTagStatistics");
            for (int i=0; i<hotCommentTagStatistics.size(); i++){
                JSONObject hotComment = hotCommentTagStatistics.getJSONObject(i);
                String name = hotComment.getString("name");
                int count = hotComment.getInteger("count");
                groups.put(name, count);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
        comments.setCommentGroups(groups);
        return comments;
    }

    public static JDComment getCommentByCD(ChromeDriver driver) {
        JDComment comment = new JDComment();
        WebElement weCommentTab = driver.findElement(By.xpath("//*[@id=\"detail\"]/div[1]/ul/li[5]"));
        weCommentTab.click();
        try {
            Thread.sleep(2000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

        WebElement weGoodRate = driver.findElement(By.cssSelector(".comment-percent .percent-con"));
        String goodRate = weGoodRate.getText();
        int len = goodRate.length();
        if (len > 1) {
            goodRate = goodRate.substring(0, len - 1);
        }
        int rate = Integer.valueOf(goodRate);

        List<WebElement> weGroupList = driver.findElements(By.cssSelector(".J-comment-info .percent-info .tag-list .tag-1"));
        HashMap<String, Integer> groups = new HashMap<>();
        for (WebElement we: weGroupList) {
            String rawGroup = we.getText();
            splitDescInfo(rawGroup, groups);
        }

        List<WebElement> weLevelList = driver.findElements(By.cssSelector(".J-comments-list .filter-list li"));
        HashMap<String, Integer> levels = new HashMap<>();
        for (WebElement we: weLevelList) {
            WebElement weLevel = we.findElement(By.cssSelector("a"));
            if (containsDatatab(weLevel)){
                //TODO
//                String rawLevel = weLevel.getText();
//                splitDescInfo(rawLevel, levels);
            }
        }
        comment.setGoodRate(rate);
        comment.setCommentGroups(groups);
        return comment;
    }

    private static boolean containsDatatab(WebElement we){
        try {
            we.getAttribute("data-tab");
            return true;
        }catch (Exception e){
            return false;
        }
    }

    private static void splitDescInfo(String desc, HashMap<String, Integer> map) {
        String info = desc;
        int commaIdx = info.indexOf("(");
        String context = info.substring(0, commaIdx);
        String strCount = info.substring(commaIdx+1, info.length() - 1);
        float count = getRealCount(strCount);
        map.put(context, (int)count);
    }

    private static float getRealCount(String rawCount) {
        float realCount;
        if (rawCount.contains("万")){
            int wanIdx = rawCount.indexOf("万");
            String strRealCount = rawCount.substring(0, wanIdx);
            realCount = Float.valueOf(strRealCount) * 10000;
        }else if (rawCount.contains("+")){
            int plusIdx = rawCount.indexOf("+");
            String strRealCount = rawCount.substring(0, plusIdx);
            realCount = Integer.valueOf(strRealCount);
        }else{
            realCount = Integer.valueOf(rawCount);
        }
        return realCount;
    }

    private static void closeAllOtherWindows(String main, ChromeDriver driver) {
        Set<String> handles = driver.getWindowHandles();
        System.out.println("------->main: " + main);
        Object []hs = handles.toArray();
        for (int i = hs.length - 1; i>0; i--) {
            System.out.println("-------->child: " + hs[i]);
            driver.switchTo().window(hs[i].toString());
            driver.close();
        }
        driver.switchTo().window(main);
    }
}

这个java类里面,重点在于处理页面切换的逻辑,否则想操作的页面数据和实际driver所指向的页面handle可能不是一个东西,导致所找的页面元素不存在的错误,这是比较常见的错误,所以,一定得注意窗口句柄的管理,爬取完毕后,页面最好是关闭掉(selenium模拟操作页面打开页面是顺序的将句柄记录在一个有序集合LinkedHashSet里面,所以,操作的时候,后打开的页面句柄在集合的后面,利用Set转换为Array的模式,简单实现窗口的关闭逻辑),因为爬取数据的场景很简单,列表页和详情页之间切换。

 

接下来,是爬取到的数据写库的过程,我操作数据,用的是很简单的spring的jdbcTemplate实现的,虽然功能不及mybatis那么强大,但是应付爬取点数据,还是够了。

JDProductDao
package com.shihuc.up.spider.jd.comment;

import com.mchange.v2.c3p0.ComboPooledDataSource;
import org.openqa.selenium.chrome.ChromeDriver;
import org.springframework.jdbc.core.BatchPreparedStatementSetter;
import org.springframework.jdbc.core.JdbcTemplate;
import org.springframework.jdbc.core.PreparedStatementCreator;
import org.springframework.jdbc.core.RowCallbackHandler;
import org.springframework.jdbc.support.GeneratedKeyHolder;
import org.springframework.jdbc.support.KeyHolder;

import java.beans.PropertyVetoException;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;

public class JDProductDao extends JdbcTemplate{

    public JDProductDao(){

        //定义c3p0连接池
        ComboPooledDataSource ds = new ComboPooledDataSource();
        try {
            ds.setDriverClass("com.mysql.jdbc.Driver");
            ds.setUser("root");
            ds.setPassword("shihuc");
            ds.setJdbcUrl("jdbc:mysql://localhost:3306/nav?characterEncoding=utf-8");
        } catch (PropertyVetoException e) {
            e.printStackTrace();
        }
        super.setDataSource(ds);
    }

    public int addProductInfoGenId(JDProduct product, String shop) {
        KeyHolder keyHolder = new GeneratedKeyHolder();
        JDComment comment = product.getComment();
        super.update(new PreparedStatementCreator(){
            final String sql="insert into good_holder_" + shop +
                    " (pid,title,brand,pname,price,url, goodrate,totalc,goodc,generalc,poorc,videoc,afterc)" +
                    " values (?,?,?,?,?,?,?,?,?,?,?,?,?)";
            public PreparedStatement createPreparedStatement(java.sql.Connection conn) throws SQLException{
                PreparedStatement ps = conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);
                ps.setString(1, product.getPid());
                ps.setString(2, product.getTitle());
                ps.setString(3, product.getBrand());
                ps.setString(4, product.getPname());
                ps.setString(5, product.getPrice());
                ps.setString(6, product.getUrl());

                ps.setInt(7, comment.getGoodRate());
                ps.setInt(8, comment.getTotalc());
                ps.setInt(9, comment.getGoodc());
                ps.setInt(10, comment.getGeneralc());
                ps.setInt(11, comment.getPoorc());
                ps.setInt(12, comment.getVideoc());
                ps.setInt(13, comment.getAfterc());
                return ps;
            }
        },keyHolder);
        return keyHolder.getKey().intValue();
    }

    public void addProductComments(JDProduct product, int rid, String shop) {
        final String sql="insert into good_holder_" + shop + " (rid,info,count) values (?,?,?)";
        List<Object []> comments = transformCommentsToObjects(rid, product.getComment());
        super.batchUpdate(sql, new BatchPreparedStatementSetter() {
            @Override
            public void setValues(PreparedStatement ps, int i)
                    throws SQLException {
                ps.setInt(1, (Integer) comments.get(i)[0]);
                ps.setString(2, (String)comments.get(i)[1]);
                ps.setInt(3, (Integer) comments.get(i)[2]);
            }
            @Override
            public int getBatchSize() {
                return comments.size();
            }
        });
    }

    public void addProductComments2(JDProduct product, int rid, String shop) {
        final String sql="insert into good_holder_" + shop + " (rid,info,count) values (?,?,?)";
        List<Object []> comments = transformCommentsToObjects(rid, product.getComment());
        super.batchUpdate(sql, comments);
    }
    private List<Object[]> transformCommentsToObjects(int rid, JDComment comments) {
        List<Object[]> list = new ArrayList<>();
        Object[] object = null;
        HashMap<String, Integer> groups = comments.getCommentGroups();
        for(String group: groups.keySet()){
            object = new Object[]{
                    rid,
                    group,
                    groups.get(group),
            };
            list.add(object);
        }
        return list ;
    }

    public List<JDProduct> updateProductForPriceSells(String tableIdx) {
        //查询数据,使用RowCallbackHandler处理结果集
        String sql = "select id, pid, price from good_holder_" + tableIdx;
        final JDProduct product = new JDProduct();

        List<JDProduct> nokProducts = new ArrayList<>();

        //将结果集数据行中的数据抽取到product对象中
        super.query(sql, new Object[]{}, new RowCallbackHandler() {
            public void processRow(ResultSet rs) throws SQLException {
                product.setId(rs.getInt("id"));
                product.setPid(rs.getString("pid"));
                product.setPrice(rs.getString("price"));

                dataProcess(product, tableIdx);
            }
        });
        return nokProducts;
    }

    public void updateNokProductForPriceSells(String tableIdx, ChromeDriver driver) {
        //查询数据,使用RowCallbackHandler处理结果集
        String sql = "select id, url, price from good_holder_" + tableIdx;
        final JDProduct product = new JDProduct();

        //将结果集数据行中的数据抽取到product对象中
        super.query(sql, new Object[]{}, new RowCallbackHandler() {
            public void processRow(ResultSet rs) throws SQLException {
                product.setId(rs.getInt("id"));
                product.setUrl(rs.getString("url"));
                product.setPrice(rs.getString("price"));

                if(isNokProduct(product, tableIdx)){
                    JDProduct pd = JDSeleniumFullCrawler.getJDProductInfoByUrl(driver, product.getUrl(), product);
                    reSetPriceOrSells(product.getId(), tableIdx, pd.getPrice());
                }
            }
        });
    }

    public boolean isNokProduct(JDProduct product, String tableIdx){
        String price = product.getPrice();
        String url = product.getUrl();
        if (price.equalsIgnoreCase("")) {
            System.out.println("good_holder_" + tableIdx + ", id=" + product.getId() + " data is not ok");

            if (url != null && !url.equalsIgnoreCase("")){
                return true;
            }
        }
        return false;
    }

    public void dataProcess(JDProduct product, String tableIdx) {
        String price = product.getPrice();
        double dlow = 0 , dhigh=0;
        if (price.equalsIgnoreCase("")) {
            System.out.println("good_holder_" + tableIdx + ", id=" + product.getId() + " data is not ok");
            return;
        }
        String low = "0", high = "0";
        if (price.contains("-")){
            int idx = price.indexOf("-");
            low = price.substring(0, idx);
            high = price.substring(idx+1);
        }else{
            low = price;
            high = price;
        }
        dlow = Double.valueOf(low);
        dhigh = Double.valueOf(high);

//        String countReg = "^[1-9][0-9]*";
//        Pattern p = Pattern.compile(countReg);
//        Matcher m = p.matcher(sells);
//        if (m.find()){
//            String sc = m.group();
//            sellCount = Integer.valueOf(sc);
//        }

        updateProductPriceSell(product.getId(), tableIdx, dlow, dhigh);
    }

    public void updateProductPriceSell(int id, String tableIdx, double priceLow, double priceHigh) {
        String sql = "update good_holder_" + tableIdx + " set priceLow=?,priceHigh=? where id=?";
        int rows = super.update(sql, priceLow, priceHigh,id);
        System.out.println(rows);
    }

    public void reSetPriceOrSells(int id, String tableIdx, String price) {
        String sql = "update good_holder_" + tableIdx + " set price=? where id=?";
        int rows = super.update(sql, price, id);
        System.out.println(rows);
    }
}

 

下面就是商品信息和评论信息的model类

JDProduct
package com.shihuc.up.spider.jd.comment;
public class JDProduct {

    private int id;
    private String pid;
    private String title;
    private String brand;
    private String pname;
    private String price;
    private String url;
    private double priceHigh;
    private double priceLow;

    private JDComment comment;

    public int getId() {
        return id;
    }

    public void setId(int id) {
        this.id = id;
    }

    public String getPid() {
        return pid;
    }

    public void setPid(String pid) {
        this.pid = pid;
    }

    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public String getBrand() {
        return brand;
    }

    public void setBrand(String brand) {
        this.brand = brand;
    }

    public String getPname() {
        return pname;
    }

    public void setPname(String pname) {
        this.pname = pname;
    }

    public String getPrice() {
        return price;
    }

    public void setPrice(String price) {
        this.price = price;
    }

    public String getUrl() {
        return url;
    }

    public void setUrl(String url) {
        this.url = url;
    }

    public double getPriceHigh() {
        return priceHigh;
    }

    public void setPriceHigh(double priceHigh) {
        this.priceHigh = priceHigh;
    }

    public double getPriceLow() {
        return priceLow;
    }

    public void setPriceLow(double priceLow) {
        this.priceLow = priceLow;
    }

    public JDComment getComment() {
        return comment;
    }

    public void setComment(JDComment comment) {
        this.comment = comment;
    }


    @Override
    public String toString() {
        return "Product{" +
                "pid=" + pid +
                ", title='" + title + '\'' +
                ", brand='" + brand + '\'' +
                ", pname='" + pname + '\'' +
                ", price=" + price + '\'' +
                '}';
    }
}

 

JDComment
package com.shihuc.up.spider.jd.comment;

import java.awt.*;
import java.util.HashMap;

public class JDComment {
    private Integer goodRate;
    /**
     * 评论内容的分类信息以及对应的条数
     */
    private HashMap<String, Integer> commentGroups;

    //天猫是销量数据,淘宝和京东一样,是累计评论数据
    private int totalc;
    private int goodc;
    private int generalc;
    private int poorc;
    private int videoc;
    private int afterc;

    public Integer getGoodRate() {
        return goodRate;
    }

    public void setGoodRate(Integer goodRate) {
        this.goodRate = goodRate;
    }

    public HashMap<String, Integer> getCommentGroups() {
        return commentGroups;
    }

    public void setCommentGroups(HashMap<String, Integer> commentGroups) {
        this.commentGroups = commentGroups;
    }

    public int getTotalc() {
        return totalc;
    }

    public void setTotalc(int totalc) {
        this.totalc = totalc;
    }

    public int getGoodc() {
        return goodc;
    }

    public void setGoodc(int goodc) {
        this.goodc = goodc;
    }

    public int getGeneralc() {
        return generalc;
    }

    public void setGeneralc(int generalc) {
        this.generalc = generalc;
    }

    public int getPoorc() {
        return poorc;
    }

    public void setPoorc(int poorc) {
        this.poorc = poorc;
    }

    public int getVideoc() {
        return videoc;
    }

    public void setVideoc(int videoc) {
        this.videoc = videoc;
    }

    public int getAfterc() {
        return afterc;
    }

    public void setAfterc(int afterc) {
        this.afterc = afterc;
    }
}

 

这里需要补充说明一下,价格和评论用到的关于httpclient拉到网页的工具类

HttpClientUtils
package com.shihuc.up.spider;

import com.shihuc.up.spider.jd.opt.JDPhoneHolder;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.client.methods.HttpRequestBase;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.util.EntityUtils;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;

public class HttpClientUtils {

    //创建httpclient连接池
    private static PoolingHttpClientConnectionManager connectionManager;
    static{
        connectionManager=new PoolingHttpClientConnectionManager();
        //定义连接池最大连接数
        connectionManager.setMaxTotal(200);
        //对指定的网址最多只有20个连接
        connectionManager.setDefaultMaxPerRoute(20);
    }

    private static CloseableHttpClient getCloseableHttpClient(){
        CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(connectionManager).build();
        return httpClient;
    }

    private static String execute(HttpRequestBase httpRequestBase) throws IOException {
        httpRequestBase.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0");

        //设置超时时间
        RequestConfig config = RequestConfig.custom().setConnectionRequestTimeout(10000).setConnectTimeout(10000).setSocketTimeout(15 * 1000).build();

        httpRequestBase.setConfig(config);
        CloseableHttpClient httpClient = getCloseableHttpClient();
        CloseableHttpResponse response = httpClient.execute(httpRequestBase);

        String html = EntityUtils.toString(response.getEntity(), "utf-8");
        return html;
    }

    private static String executeReferer(HttpRequestBase httpRequestBase, String referer) throws IOException {
        httpRequestBase.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0");
        httpRequestBase.setHeader("Referer", referer);
        httpRequestBase.setHeader("Sec-Fetch-Mode", "no-cors");

        //设置超时时间
        RequestConfig config = RequestConfig.custom().setConnectionRequestTimeout(60000).setConnectTimeout(60000).setSocketTimeout(10 * 10000).build();

        httpRequestBase.setConfig(config);
        CloseableHttpClient httpClient = getCloseableHttpClient();
        CloseableHttpResponse response = httpClient.execute(httpRequestBase);

        String html = EntityUtils.toString(response.getEntity(), "utf-8");
        return html;
    }

    public static String doGet(String url) throws IOException {
        HttpGet httpGet = new HttpGet(url);
        String html = execute(httpGet);
        return html;
    }

    public static String doGetReferer(String url, String referer) throws IOException {
        HttpGet httpGet = new HttpGet(url);
        String html = executeReferer(httpGet, referer);
        return html;
    }

    public static String doPost(String url, Map<String,String> params) throws IOException {
        HttpPost httpPost = new HttpPost(url);

        List<BasicNameValuePair> list = new ArrayList<>();
        for (String key : params.keySet()) {
            list.add(new BasicNameValuePair(key,params.get(key)));
        }

        UrlEncodedFormEntity entity = new UrlEncodedFormEntity(list);
        httpPost.setEntity(entity);

        return execute(httpPost);
    }

    public static void main(String args[]) {
        String pid = "4310407";
//        try {
//            JDPhoneHolder.getCommitCount(pid);
//        } catch (IOException e) {
//            e.printStackTrace();
//        }

        try {
            int commitCountNum = JDPhoneHolder.getCommitCountNum(pid);
            System.out.println("产品: " + pid + ", 评论数:" + commitCountNum);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

 

针对所用到的表结构,也附在这里:

产品表:

CREATE TABLE `good_holder_jd_info_czsjzj` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `pid` varchar(32) NOT NULL COMMENT '产品ID',
  `title` varchar(1024) NOT NULL COMMENT '产品标题描述',
  `brand` varchar(1024) NOT NULL COMMENT '产品品牌',
  `pname` varchar(1024) NOT NULL COMMENT '产品名称',
  `price` varchar(32) NOT NULL COMMENT '产品价格',
  `url` varchar(2048) NOT NULL COMMENT '产品链接',
  `priceLow` double(16,2) DEFAULT NULL COMMENT '商品的低价',
  `priceHigh` double(16,2) DEFAULT NULL COMMENT '商品的高价',
  `goodrate` int(11) DEFAULT NULL COMMENT '产品评论分数',
  `totalc` int(64) DEFAULT NULL COMMENT '总评论数',
  `goodc` int(11) DEFAULT NULL COMMENT '好评数量',
  `generalc` int(11) DEFAULT NULL COMMENT '中评数量',
  `poorc` int(11) DEFAULT NULL COMMENT '差评数量',
  `videoc` int(11) DEFAULT NULL COMMENT '视频晒单量',
  `afterc` int(11) DEFAULT NULL COMMENT '追评数量',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1134 DEFAULT CHARSET=utf8mb4

评论分类表(我这里没有抓评论的详情数据,我只抓取了评论的类别和次数数据)

CREATE TABLE `good_holder_jd_comment_czsjzj` (
  `rid` int(11) NOT NULL COMMENT '评论对应的产品记录的主键ID',
  `info` varchar(256) DEFAULT NULL COMMENT '描述内容信息',
  `count` int(11) DEFAULT NULL COMMENT '对应内容的条数'
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4

这个评论分类表的数据类似下图红圈内的内容:

 

写在博文的最后,关于抓取JD商品价格和评论数据的方法:

//获取价格,只需要传入商品的ID即可
public static String getPrice(String pid) throws IOException {
        String priceUrl="https://p.3.cn/prices/mgets?pduid="+Math.random()+"&skuIds=J_"+pid;
        String priceJson = HttpClientUtils.doGet(priceUrl);
        System.out.println(priceJson);
        Gson gson = new GsonBuilder().create();
        List<Map<String,String>> list = gson.fromJson(priceJson, List.class);
        return list.get(0).get("p");
    }
//获取商品的评论信息,只需要传入商品的ID即可
public static JSONObject getComments(String pid) throws IOException {
        String baseUrl = "https://sclub.jd.com/comment/productPageComments.action?score=0&sortType=5&page=1&pageSize=1&isShadowSku=0&productId=" + pid;
        String commentJson = HttpClientUtils.doGet(baseUrl);
        System.out.println(commentJson);

        JSONObject jsonObject = JSON.parseObject(commentJson);

        return jsonObject;
    }

两个函数中,红色URL部分,是重点内容,从这两个URL来看,JD的商城站点信息,相对设计的还是比较简单的。

 

这篇博文,就分享到这里吧,上述爬虫程序(主要是爬取车载手机支架信息的),稍微修改一下,就可以爬取其他商品的类似信息。欢迎评论,欢迎给出绕开阿里反爬技术的解决方案!

 

标签:product,return,String,int,爬取,import,京东,public,商城
来源: https://www.cnblogs.com/shihuc/p/12528377.html

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有