I am trying to capture everything after and including the first non-digit character in the following text:
1 1,486,399.87 5 ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED
00814766
P O BOX 883 FAX 909 386-1288
COLTON CA 92324
For example, I would want regex to capture groups in a way that it matches: 1
, 1,486,399.87
, 5
, and ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED 00814766 P O BOX 883 FAX 909 386-1288 COLTON CA 92324
.
The code I have right now is:
# imports
import os
import pandas as pd
import re
import docx2txt
import textract
import antiword
import itertools
# text
t = " 1 1,486,399.87 5 ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED
00814766
P O BOX 883 FAX 909 386-1288
COLTON CA 92324"
tt = re.search(r"(\d )\s (\$?[ -]?\d{1,3}(\,\d{3})*%?(\.\d )?)\s (\d )\s (\S*)", t)
ttgroup = len(tt.groups())
print(tt[ttgroup])
It returns only ORTIZ
. I suppose we need to improve the (S*) grouping towards the end of the pattern. Is there a way we could capture the entire ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED 00814766 P O BOX 883 FAX 909 386-1288 COLTON CA 92324
in the last group? Thank you so much!
CodePudding user response:
I'd replace the last group, that is now (\S*)
, with (\S.*)
since you want to capture the rest of the string. Also add the re.DOTALL
flag since this is a multiline string:
tt = re.search(r"(\d )\s (\$?[ -]?\d{1,3}(\,\d{3})*%?(\.\d )?)\s (\d )\s (\S.*)", t, re.DOTALL)