python 3.x - Regex: Match a malformed date -


i'm trying grab date (without time) following ocr'd strings:

04.10.2015, in usd 04.10.20 15, in eur 04,1 0.2015, in xyz 1 1. 10.2 01 5, in xyz 0 1.11.201 5 12:30 1 1,0 3, 2 0 1 5 1 2:3 0 

with following expression can catch dates, can't skip "12" hours:

([\d\s]{2,}(?:\.|,)[\d\s]{2,}(?:\.|,)[\d\s]{4,}) 

how can make work? in plain english, how can make last part stop once has found 4 digits in mix of digits , spaces/tabs?

by catching first 8 digits on line, date.

  • \d non-digit charater
  • \d digit character
  • (?:...) group ignored
  • ^\d* used ignore beginning of line until digit

we match 8 times digits followed non-numerics characters, starting first digit found.

import re p = re.compile(ur'^\d*((?:\d\d*?){8})', re.multiline) test_str = u"""04.10.2015, in usd     04.10.20 15, in eur     04,1 0.2015, in xyz     1 1. 10.2 01 5, in xyz     0 1.11.201 5 12:30     1 1,0 3, 2 0 1 5 1 2:3 0 """  print re.findall(p, test_str) 

have test on here: https://regex101.com/r/eq8zj9/4

you can filter out non digits date:

from datetime import datetime  s in re.findall(p, test_str):     digits = re.sub(ur'\d', '', s)     print datetime.strptime(digits, '%d%m%y') 

Comments

Popular posts from this blog

java - Date formats difference between yyyy-MM-dd'T'HH:mm:ss and yyyy-MM-dd'T'HH:mm:ssXXX -

c# - Get rid of xmlns attribute when adding node to existing xml -