首页 > 编程语言> 文章详细

python – 使用Regex搜索关键字附近的HTML链接

2019-07-04 13:44:35 阅读：139 来源： 互联网

标签：python regex negative-lookahead

如果我正在寻找关键字“sales”,即使文件中有多个链接,我也希望得到最近的“http://www.somewebsite.com”.我想最近的链接不是第一个链接.这意味着我需要搜索关键字匹配之前的链接.

这不起作用……

regex =(http | https)：// [-A-Za-z0-9./].*(？！((http | https)：// [-A-Za-z0-9./]))销售
销售

什么是找到最接近关键字的链接的最佳方法？

解决方法:

使用HTML解析器而不是正则表达式通常更容易,更健壮.

使用第三方模块lxml：

import lxml.html as LH

content = '''<html><a href="http://www.not-this-one.com"></a>
<a href="http://www.somewebsite.com"></a><p>other stuff</p><p>sales</p>
</html>
'''

doc = LH.fromstring(content)    
for url in doc.xpath('''
    //*[contains(text(),"sales")]
    /preceding::*[starts-with(@href,"http")][1]/@href'''):
    print(url)

产量

http://www.somewebsite.com

我发现lxml(和XPath)是表达我正在寻找的元素的便捷方式.但是,如果不能选择安装第三方模块,则还可以使用标准库中的HTMLParser完成此特定作业：

import HTMLParser
import contextlib

class MyParser(HTMLParser.HTMLParser):
    def __init__(self):
        HTMLParser.HTMLParser.__init__(self)
        self.last_link = None

    def handle_starttag(self, tag, attrs):
        attrs = dict(attrs)
        if 'href' in attrs:
            self.last_link = attrs['href']

content = '''<html><a href="http://www.not-this-one.com"></a>
<a href="http://www.somewebsite.com"></a><p>other stuff</p><p>sales</p>
</html>
'''

idx = content.find('sales')

with contextlib.closing(MyParser()) as parser:
    parser.feed(content[:idx])
    print(parser.last_link)

关于lxml解决方案中使用的XPath：XPath具有以下含义：

 //*                              # Find all elements
   [contains(text(),"sales")]     # whose text content contains "sales"
   /preceding::*                  # search the preceding elements 
     [starts-with(@href,"http")]  # such that it has an href attribute that starts with "http"
       [1]                        # select the first such <a> tag only
         /@href                   # return the value of the href attribute

标签：python,regex,negative-lookahead
来源： https://codeday.me/bug/20190704/1377692.html

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9

python – 使用Regex搜索关键字附近的HTML链接