ICode9

精准搜索请尝试: 精确搜索
首页 > 编程语言> 文章详细

python – 正则表达式消除bibtex文件中的字段

2019-07-05 03:57:23  阅读:299  来源: 互联网

标签:python regex parsing bibtex


我试图减少从我的参考管理器获取的bib文本文件,因为它留下了额外的字段,当我把它放入LaTeX时最终会被破坏.

我要清理的一个特色条目是:

@Article{Kholmurodov:2001p113,
author = {K Kholmurodov and I Puzynin and W Smith and K Yasuoka and T Ebisuzaki}, 
journal = {Computer Physics Communications},
title = {MD simulation of cluster-surface impacts for metallic phases: soft landing, droplet spreading and implantation},
abstract = {Lots of text here.  Even more text.},
affiliation = {RIKEN, Inst Phys {\&} Chem Res, Computat Sci Div, Adv Comp Ctr, Wako, Saitama 3510198, Japan},
number = {1},
pages = {1--16},
volume = {141},
year = {2001},
month = {Dec},
language = {English},
keywords = {Ethane, molecular dynamics, Clusters, Dl_Poly Code, solid surface, metal, Hydrocarbon Thin-Films, Adsorption, impact, Impact Processes, solid surface, Molecular Dynamics Simulation, Large Systems, DL_POLY, Beam Deposition, Package, Collision-Induced Desorption, Diamond Films, Vapor-Deposition, Transition-Metals, Molecular-Dynamics Simulation}, 
date-added = {2008-06-27 08:58:25 -0500},
date-modified = {2009-03-24 15:40:27 -0500},
pmid = {000172275000001},
local-url = {file://localhost/User/user/Papers/2001/Kholmurodov/Kholmurodov-MD%20simulation%20of%20cluster-surface%20impacts-2001.pdf},
uri = {papers://B08E511A-2FA9-45A0-8612-FA821DF82090/Paper/p113},
read = {Yes},
rating = {0}
}

我想删除像月,抽象,关键字等字段,其中一些是单行,其中一些是多行.

我在Python中尝试了一下,像这样:

fOpen = open(f,'r')
start_text = fOpen.read()
fOpen.close()

# regex
out_text = re.sub(r'^(month).*,\n','',start_text)
out_text = re.sub(r'^(annote)((.|\n)*?)\},\n','',out_text)
out_text = re.sub(r'^(note)((.|\n)*?)\},\n','',out_text)
out_text = re.sub(r'^(abstract)((.|\n)*?)\},\n','',out_text)

fNew = open(f,'w')
fNew.write(out_text)
fNew.close()

我试图在TextMate中运行这些正则表达式,看看它们是否在使用Python之前尝试它们并且看起来没问题.

有什么建议?

谢谢.

解决方法:

这个正则表达式怎么样(适用于多行和dotall标志):

^(?:month|annote|note|abstract)\s*=\s*\{(?:(?!\},$).)*\},[\r\n]+

说明:

^                             # start-of-line
(?:                           # non-capturing group 1
  month|annote|note|abstract  #   one of these terms
)                             # end non-capturing group 1
\s*=\s*                       # whitespace, an equals sign, whitespace
\{                            # a literal curly brace
(?:                           # non-capturing group 2
  (?!                         #   negative look-ahead (if not followed by...)
    \},$                     #     a curly brace, a comma and the end-of-line
  )                           #   end negative look-ahead
  .                           #   ...then match next character, whatever it is
)*                            # end non-capturing group 2, repeat
\},                           # a literal curly brace and a comma
[\r\n]+                       # at least one end-of-line character

此单个表达式可在一个步骤中对所有受影响的行进行排序.

编辑/警告:请注意,一旦发生以下情况,这将失败:

affiliation = {RIKEN, Inst Phys {\&},
Computat Sci Div, Adv Comp Ctr, Wako, Saitama 3510198, Japan},

正则表达式无法处理嵌套结构.在这种情况下,没有纯正则表达式解决方案在所有情况下都是正确的,您可以获得的最佳结果是一个很好的近似值.

问题是,如果你是100%确定上述情况不会发生(我认为你不可能) – 或者你是否愿意承担风险.如果你不完全确定这不会有问题 – 使用或编写解析器.

标签:python,regex,parsing,bibtex
来源: https://codeday.me/bug/20190705/1383769.html

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有