ICode9

精准搜索请尝试: 精确搜索
首页 > 其他分享> 文章详细

爬虫-姓名测试打分2

2022-01-22 11:34:52  阅读:198  来源: 互联网

标签:soup 爬虫 df2 headers 姓名 import ming com 打分


一、获取汉字

import pandas as pd
import requests
from bs4 import BeautifulSoup
session=requests.session()

#http://xh.5156edu.com/pinyi.html 所有拼音的导航地址
#https://www.xingming.com/dafen/ 测试得分。 ⺋
url1="http://xh.5156edu.com/pinyi.html"


headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36 QIHU 360SE'
}
r1=session.get(url1, headers=headers)
soup = BeautifulSoup(r1.content, 'lxml')

list1 = soup.select("tr > td > a.fontbox")

list2=[] # 每个拼音连接
for i in list1:
    list2.append([i.get("href"),i.text.strip()])



def f2(url2):#返回汉字
    #url2 = "http://xh.5156edu.com/html2/p105.html"
    r2=session.get(url2, headers=headers)
    r2.encoding = 'gb18030' 
    soup = BeautifulSoup(r2.text, 'lxml')
    list3 = soup.select("a.fontbox")
    list4 = []
    for i in list3:
        list4.append(i.text[0])
    return list4



import time
list5=[]
for i in list2:
    i2 = "http://xh.5156edu.com/"+i[0]
    print(i2)
    list5.append(f2(i2))
    time.sleep(1)

    
#写出汉字
with open("hanzi.txt","w",encoding="utf8") as f:
    for i in list5:
        f.write("|".join(i)+"\n")
    f.close()
    
View Code

二、获取打分网站的评分

# -*- coding: utf-8 -*-
"""
Created on Sun Nov 21 22:31:06 2021

@author: Administrator
"""

import pandas as pd
import requests
from bs4 import BeautifulSoup
session=requests.session()
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36 QIHU 360SE'
}

with open("hanzi.txt","r",encoding="utf8") as f:
    list1 = f.readlines()
    f.close()
    

# 展开所有字典 21763
list2=[]
for i in list1:
    i2 = i.strip().split("|")
    if len(i2)>0:
        list2.extend(i2)
        
# 
def ff3(ming):
    #ming="堂"
    url3 = "https://www.xingming.com/dafen/"
    dict0={'xs': '李',
    'mz': f'金{ming}',
    'action': 'test'}
    r4 = requests.post(url3,data=dict0, headers=headers)
    soup = BeautifulSoup(r4.content, 'lxml')
    try:
        score = soup.select("font[color='ff0000']")[0].text
    except IndexError :
        score = soup.text[:15]
    return score
    
ming="⺋"
ff3(ming)

ming="⺋"
url3 = "https://www.xingming.com/dafen/"
dict0={'xs': '李',
'mz': f'金{ming}',
'action': 'test'}
r4 = requests.post(url3,data=dict0, headers=headers)
soup = BeautifulSoup(r4.content, 'lxml')
try:
    1/0
    score = soup.select("font[color='ff0000']")[0].text
except IndexError :
    score = soup.text[:15]
score


df1 = pd.DataFrame([[i,None] for i in list2])
df1.columns=['1','2']
df1 = df1.drop_duplicates().reset_index(drop=True).copy()


df2 = df1[df1['2']!=None].copy()
import datetime,time

for i in range(df2.shape[0]):
    now_time = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    ming = df2.iloc[i,0]
    if df2.iloc[i,1]!=None:
        continue
    try:
        soc = ff3(ming)
        try:
            soc=float(soc)
            df2.iloc[i,1] = soc
        except:
            df2.iloc[i,1] = soc
    except Exception as e:
        print(now_time,"----err---",str(e),df2.iloc[i,0])
    if i%100 == 0:
        print(now_time,"----------",i)  
    time.sleep(0.2)



df2.to_excel("soc.xlsx")


df3 = df2

标签:soup,爬虫,df2,headers,姓名,import,ming,com,打分
来源: https://www.cnblogs.com/andylhc/p/15832646.html

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有