python 3.x - Regex: Match a malformed date -
i'm trying grab date (without time) following ocr'd strings:
04.10.2015, in usd 04.10.20 15, in eur 04,1 0.2015, in xyz 1 1. 10.2 01 5, in xyz 0 1.11.201 5 12:30 1 1,0 3, 2 0 1 5 1 2:3 0
with following expression can catch dates, can't skip "12" hours:
([\d\s]{2,}(?:\.|,)[\d\s]{2,}(?:\.|,)[\d\s]{4,})
how can make work? in plain english, how can make last part stop once has found 4 digits in mix of digits , spaces/tabs?
by catching first 8 digits on line, date.
\d
non-digit charater\d
digit character(?:...)
group ignored^\d*
used ignore beginning of line until digit
we match 8 times digits followed non-numerics characters, starting first digit found.
import re p = re.compile(ur'^\d*((?:\d\d*?){8})', re.multiline) test_str = u"""04.10.2015, in usd 04.10.20 15, in eur 04,1 0.2015, in xyz 1 1. 10.2 01 5, in xyz 0 1.11.201 5 12:30 1 1,0 3, 2 0 1 5 1 2:3 0 """ print re.findall(p, test_str)
have test on here: https://regex101.com/r/eq8zj9/4
you can filter out non digits date:
from datetime import datetime s in re.findall(p, test_str): digits = re.sub(ur'\d', '', s) print datetime.strptime(digits, '%d%m%y')
Comments
Post a Comment