Home > Back-end >  Capturing the last group: everything when the first character appears
Capturing the last group: everything when the first character appears

Time:02-03

I am trying to capture everything after and including the first non-digit character in the following text:

1         1,486,399.87    5              ORTIZ ASPHALT PAVING INC              909 386-1200  SB PREF CLAIMED
                                                                                                  00814766
                                                            P O BOX 883                       FAX 909 386-1288
                                                            COLTON CA  92324

For example, I would want regex to capture groups in a way that it matches: 1, 1,486,399.87, 5, and ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED 00814766 P O BOX 883 FAX 909 386-1288 COLTON CA 92324.

The code I have right now is:

# imports
import os
import pandas as pd
import re
import docx2txt
import textract
import antiword
import itertools

# text
t = "    1         1,486,399.87    5              ORTIZ ASPHALT PAVING INC              909 386-1200  SB PREF CLAIMED
                                                                                                  00814766
                                                            P O BOX 883                       FAX 909 386-1288
                                                            COLTON CA  92324"

tt = re.search(r"(\d )\s (\$?[ -]?\d{1,3}(\,\d{3})*%?(\.\d )?)\s (\d )\s (\S*)", t)

ttgroup = len(tt.groups())

print(tt[ttgroup])

It returns only ORTIZ. I suppose we need to improve the (S*) grouping towards the end of the pattern. Is there a way we could capture the entire ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED 00814766 P O BOX 883 FAX 909 386-1288 COLTON CA 92324 in the last group? Thank you so much!

CodePudding user response:

I'd replace the last group, that is now (\S*), with (\S.*) since you want to capture the rest of the string. Also add the re.DOTALL flag since this is a multiline string:

tt = re.search(r"(\d )\s (\$?[ -]?\d{1,3}(\,\d{3})*%?(\.\d )?)\s (\d )\s (\S.*)", t, re.DOTALL)
  • Related