regex - Python regular expression using the OR operator -
i trying parse large sample of text files regular expressions (re). trying extract these files part of text contains 'vu' , ends newline '\n'.
patterns differ 1 file another, tried combinations of re in files using or operator. however, did not find way automate code re.findall() function looks combination of re.
here example of how tried tackle issue, apparently still can not evaluate both regular expressions , or operator in re.findall():
import re def series2string(myserie) : myserie2 = ' or '.join(serie serie in myserie) return myserie2 def expression(pattern, mystring) : x = re.findall(pattern, mystring) if len(x)>0: return 1 else: return 0 #text example text = "\n\n (troisième chambre)\n - vu la requête, enregistrée le 28 février 1997 sous le n° 97nc00465, présentée pour m. z... farinez, demeurant ... à dommartin-aux-bois (vosges), par me y..., avocat ;\n" #expressions out pattern1 = '^\s*vu.*\n' pattern2 = '^\s*\(\w*\s*\w*\)\s*.*?vu.*\n' pattern = [pattern1, pattern2] pattern = series2string(pattern) expression(pattern, text)
note : circumvented problem looking each pattern in for loop code run faster if use re.findall() once.
python regular expressions uses |
operator alternation.
def series2string(myserie) : myserie2 = '|'.join(serie serie in myserie) myserie2 = '(' + myserie2 + ')' return myserie2
more information: https://docs.python.org/3/library/re.html
the individual patterns messy, don't know mistake, , intentional. guessing looking word "vu" in few different contexts.
- always use python raw strings regular expressions, prefixed
r
(r'pattern here'
). allows use\
in pattern without python trying interpret string escape. passed directly regex engine. (ref) - use
\s
match white-space (spaces , line-breaks). - since have several alternative patterns, don't make
(
,)
optional. can result in catastrophic backtracking, can make matching large strings slow.
\(?
→\(
\)?
→\)
{1}
doesn't anything. repeats previous sub-pattern once, same not specifying anything.\br
invalid. interpreted\b
(ascii bell-character) + letterr
.- you have quote character (
'
) @ beginning of text-string. either intend^
match start of line, or'
copy/paste error. some errors when combining patterns:
pattern = [pattern1, pattern2, pattern3, pattern4] pattern = series2string(pattern) expression(re.compile(pattern), text)
Casino | Mapyro
ReplyDeleteVisit Mapyro 원주 출장마사지 Casino and enjoy your favorite slot 양주 출장안마 machines, table games and slots from the comfort of your 의정부 출장샵 own 강릉 출장샵 home. No matter your state, you can 용인 출장안마 always find