How do I make regex .findall() return all matches within for-loop as intended?-CodePudding

I am trying to write a for-loop that iterates through individual rows. It uses regex to find a specific date identified by name. It then strips the date name, and saves the date itself as a list object for placement in an appropriate empty column.

My issue is that some rows have multiple dates of the same name (e.g. 'Exit Date: xx/xx/xxxx), and the re.findall in my for-loop is only saving the first date that matches the pattern, instead of all of them.

My barebones Regex query test that only works on a single row, 37, finds all dates and prints them appropriately. However, the moment I increase the regex pattern to re.findall(r'(Exit Date:.*?\d{2}/\d{2}/\d{4})', x) in the for-loop, it begins to only return one single date, instead of all of them (if there happens to be more than one 'Exit Date: xx/xx/xxxx'.

x = exit_note.loc[37, 'Exit Note']

match = re.findall((r'(\d{2}/\d{2}/\d{4})'), x)
if match:
        print(match)
else:
        print('no match')

Prints out ['03/10/2020', '03/06/2020']

The actual for-loop code is as follows:

exit_note_date = []
for index, row in exit_note.iterrows():
    x = row['Exit Note']
    edmatch = re.findall(r'(Exit Date:.*?\d{2}/\d{2}/\d{4})', x)
    if len(edmatch) > 0:
        edstring = edmatch[0].strip('Exit Date:   ')
        exit_note_date.append(edstring)
    else:
        exit_note_date.append('null')

        print(edstring)

exit_note['Exit Date note'] = pd.Series(exit_note_date)

The for-loop works, but re.findall only retrieves one single date per row before inserting it into the appropriate column.

Any ideas on how to make the for-loop enter the appropriate number of dates into the date column, when more than one date exists in the row? I am new to Python, new to Regex, and new to Pandas - but my understanding is that re.findall should be returning every and all patterns, instead of just the first one it finds.

Thanks!

CodePudding user response：

as an additional answer, you can use ttp parser to get all data. You can also use regex to capture data, but, as far as I understand you won't even need regex if you use ttp option.

from ttp import ttp
import json

template_date = """
Exit Date: {{date1}}
"""

template_date2 = """
Exit Date: {{date1}} {{date2}}
"""

templates = [template_date, template_date2]

with open("text_text.txt") as f:
    data_to_parse = f.read()

def parsing(data_to_parse, ttp_template):

parser = ttp(data=data_to_parse, template=ttp_template)
parser.parse()

# print result in JSON format
results = parser.result(format='json')[0]
#print(results)

#converting str to json. 
result = json.loads(results)

print(result)

for ttp_template in templates:
    parsing(data_to_parse, ttp_template)

see the output:

[{'date1': '03/10/2020'}]

[{'date1': '03/10/2020', 'date2': '03/06/2020'}]

see text_text.txt file:

Exit Date: 03/10/2020

Exit Date: 03/10/2020 03/06/2020

Regards.

CodePudding user response：

was able to come up with a for-loop that does what I want. Thanks to everyone for your assistance!

exit_note_date = []
for index, row in exit_note.iterrows():
    x = row['Exit Note']
    edmatch = re.findall(r'(Exit Date:.*?\d{2}/\d{2}/\d{4})', x)
    if len(edmatch) > 0:
        edstring = [exit_date.strip('Exit Date:    ') for exit_date in edmatch]

        exit_note_date.append(edstring)
    else:
        exit_note_date.append('null')

exit_note['Exit Date note'] = pd.Series(exit_note_date)