I am trying to write a for-loop that iterates through individual rows. It uses regex to find a specific date identified by name. It then strips the date name, and saves the date itself as a list object for placement in an appropriate empty column.
My issue is that some rows have multiple dates of the same name (e.g. 'Exit Date: xx/xx/xxxx), and the re.findall in my for-loop is only saving the first date that matches the pattern, instead of all of them.
My barebones Regex query test that only works on a single row, 37, finds all dates and prints them appropriately. However, the moment I increase the regex pattern to re.findall(r'(Exit Date:.*?\d{2}/\d{2}/\d{4})', x)
in the for-loop, it begins to only return one single date, instead of all of them (if there happens to be more than one 'Exit Date: xx/xx/xxxx'.
x = exit_note.loc[37, 'Exit Note']
match = re.findall((r'(\d{2}/\d{2}/\d{4})'), x)
if match:
print(match)
else:
print('no match')
Prints out ['03/10/2020', '03/06/2020']
The actual for-loop code is as follows:
exit_note_date = []
for index, row in exit_note.iterrows():
x = row['Exit Note']
edmatch = re.findall(r'(Exit Date:.*?\d{2}/\d{2}/\d{4})', x)
if len(edmatch) > 0:
edstring = edmatch[0].strip('Exit Date: ')
exit_note_date.append(edstring)
else:
exit_note_date.append('null')
print(edstring)
exit_note['Exit Date note'] = pd.Series(exit_note_date)
The for-loop works, but re.findall only retrieves one single date per row before inserting it into the appropriate column.
Any ideas on how to make the for-loop enter the appropriate number of dates into the date column, when more than one date exists in the row? I am new to Python, new to Regex, and new to Pandas - but my understanding is that re.findall should be returning every and all patterns, instead of just the first one it finds.
Thanks!
CodePudding user response:
as an additional answer, you can use ttp parser to get all data. You can also use regex to capture data, but, as far as I understand you won't even need regex if you use ttp option.
from ttp import ttp
import json
template_date = """
Exit Date: {{date1}}
"""
template_date2 = """
Exit Date: {{date1}} {{date2}}
"""
templates = [template_date, template_date2]
with open("text_text.txt") as f:
data_to_parse = f.read()
def parsing(data_to_parse, ttp_template):
parser = ttp(data=data_to_parse, template=ttp_template)
parser.parse()
# print result in JSON format
results = parser.result(format='json')[0]
#print(results)
#converting str to json.
result = json.loads(results)
print(result)
for ttp_template in templates:
parsing(data_to_parse, ttp_template)
see the output:
[{'date1': '03/10/2020'}]
[{'date1': '03/10/2020', 'date2': '03/06/2020'}]
see text_text.txt file:
Exit Date: 03/10/2020
Exit Date: 03/10/2020 03/06/2020
Regards.
CodePudding user response:
was able to come up with a for-loop that does what I want. Thanks to everyone for your assistance!
exit_note_date = []
for index, row in exit_note.iterrows():
x = row['Exit Note']
edmatch = re.findall(r'(Exit Date:.*?\d{2}/\d{2}/\d{4})', x)
if len(edmatch) > 0:
edstring = [exit_date.strip('Exit Date: ') for exit_date in edmatch]
exit_note_date.append(edstring)
else:
exit_note_date.append('null')
exit_note['Exit Date note'] = pd.Series(exit_note_date)