regex - Python regular expression using the OR operator -


i trying parse large sample of text files regular expressions (re). trying extract these files part of text contains 'vu' , ends newline '\n'.

patterns differ 1 file another, tried combinations of re in files using or operator. however, did not find way automate code re.findall() function looks combination of re.

here example of how tried tackle issue, apparently still can not evaluate both regular expressions , or operator in re.findall():

import re  def series2string(myserie) :     myserie2 = ' or '.join(serie serie in myserie)     return myserie2  def expression(pattern, mystring) :      x = re.findall(pattern, mystring)     if len(x)>0:         return 1     else:         return 0  #text example text = "\n\n    (troisième chambre)\n    - vu la requête, enregistrée le 28 février 1997 sous le n° 97nc00465, présentée pour m. z... farinez, demeurant ... à dommartin-aux-bois (vosges), par me y..., avocat ;\n"  #expressions out pattern1 = '^\s*vu.*\n' pattern2 = '^\s*\(\w*\s*\w*\)\s*.*?vu.*\n'  pattern = [pattern1, pattern2] pattern = series2string(pattern)  expression(pattern, text) 

note : circumvented problem looking each pattern in for loop code run faster if use re.findall() once.

python regular expressions uses | operator alternation.

def series2string(myserie) :     myserie2 = '|'.join(serie serie in myserie)     myserie2 = '(' + myserie2 + ')'     return myserie2 

more information: https://docs.python.org/3/library/re.html


the individual patterns messy, don't know mistake, , intentional. guessing looking word "vu" in few different contexts.

  1. always use python raw strings regular expressions, prefixed r (r'pattern here'). allows use \ in pattern without python trying interpret string escape. passed directly regex engine. (ref)
  2. use \s match white-space (spaces , line-breaks).
  3. since have several alternative patterns, don't make ( , ) optional. can result in catastrophic backtracking, can make matching large strings slow.
    \(?\(
    \)?\)
  4. {1} doesn't anything. repeats previous sub-pattern once, same not specifying anything.
  5. \br invalid. interpreted \b (ascii bell-character) + letter r.
  6. you have quote character (') @ beginning of text-string. either intend ^ match start of line, or ' copy/paste error.
  7. some errors when combining patterns:

    pattern = [pattern1, pattern2, pattern3, pattern4] pattern = series2string(pattern)  expression(re.compile(pattern), text) 

Comments

  1. Casino | Mapyro
    Visit Mapyro 원주 출장마사지 Casino and enjoy your favorite slot 양주 출장안마 machines, table games and slots from the comfort of your 의정부 출장샵 own 강릉 출장샵 home. No matter your state, you can 용인 출장안마 always find

    ReplyDelete

Post a Comment

Popular posts from this blog

java - Date formats difference between yyyy-MM-dd'T'HH:mm:ss and yyyy-MM-dd'T'HH:mm:ssXXX -

c# - Get rid of xmlns attribute when adding node to existing xml -