Home > Software design >  How do I make regex .findall() return all matches within for-loop as intended?
How do I make regex .findall() return all matches within for-loop as intended?

Time:03-21

I am trying to write a for-loop that iterates through individual rows. It uses regex to find a specific date identified by name. It then strips the date name, and saves the date itself as a list object for placement in an appropriate empty column.

My issue is that some rows have multiple dates of the same name (e.g. 'Exit Date: xx/xx/xxxx), and the re.findall in my for-loop is only saving the first date that matches the pattern, instead of all of them.

My barebones Regex query test that only works on a single row, 37, finds all dates and prints them appropriately. However, the moment I increase the regex pattern to re.findall(r'(Exit Date:.*?\d{2}/\d{2}/\d{4})', x) in the for-loop, it begins to only return one single date, instead of all of them (if there happens to be more than one 'Exit Date: xx/xx/xxxx'.

x = exit_note.loc[37, 'Exit Note']

match = re.findall((r'(\d{2}/\d{2}/\d{4})'), x)
if match:
        print(match)
else:
        print('no match')

Prints out ['03/10/2020', '03/06/2020']

The actual for-loop code is as follows:

exit_note_date = []
for index, row in exit_note.iterrows():
    x = row['Exit Note']
    edmatch = re.findall(r'(Exit Date:.*?\d{2}/\d{2}/\d{4})', x)
    if len(edmatch) > 0:
        edstring = edmatch[0].strip('Exit Date:   ')
        exit_note_date.append(edstring)
    else:
        exit_note_date.append('null')

        print(edstring)

exit_note['Exit Date note'] = pd.Series(exit_note_date)

The for-loop works, but re.findall only retrieves one single date per row before inserting it into the appropriate column.

Any ideas on how to make the for-loop enter the appropriate number of dates into the date column, when more than one date exists in the row? I am new to Python, new to Regex, and new to Pandas - but my understanding is that re.findall should be returning every and all patterns, instead of just the first one it finds.

Thanks!

CodePudding user response:

as an additional answer, you can use ttp parser to get all data. You can also use regex to capture data, but, as far as I understand you won't even need regex if you use ttp option.

from ttp import ttp
import json

template_date = """
Exit Date: {{date1}}
"""

template_date2 = """
Exit Date: {{date1}} {{date2}}
"""

templates = [template_date, template_date2]

with open("text_text.txt") as f:
    data_to_parse = f.read()

def parsing(data_to_parse, ttp_template):

parser = ttp(data=data_to_parse, template=ttp_template)
parser.parse()

# print result in JSON format
results = parser.result(format='json')[0]
#print(results)

#converting str to json. 
result = json.loads(results)

print(result)

for ttp_template in templates:
    parsing(data_to_parse, ttp_template)

see the output:

[{'date1': '03/10/2020'}]

[{'date1': '03/10/2020', 'date2': '03/06/2020'}]

see text_text.txt file:

Exit Date: 03/10/2020

Exit Date: 03/10/2020 03/06/2020

Regards.

CodePudding user response:

was able to come up with a for-loop that does what I want. Thanks to everyone for your assistance!

exit_note_date = []
for index, row in exit_note.iterrows():
    x = row['Exit Note']
    edmatch = re.findall(r'(Exit Date:.*?\d{2}/\d{2}/\d{4})', x)
    if len(edmatch) > 0:
        edstring = [exit_date.strip('Exit Date:    ') for exit_date in edmatch]

        exit_note_date.append(edstring)
    else:
        exit_note_date.append('null')

exit_note['Exit Date note'] = pd.Series(exit_note_date)
  • Related