I’m trying to extract specific substrings from larger phrases contained in my Pandas dataframe. I have rows formatted like so:
Appointment of DAVID MERRIGAN of Hammonds Plains, Nova Scotia, to be a member of the Inuvialuit Arbitration Board, to hold office during pleasure for a term of three years.
Appointment of CARLA R. CONKIN of Fort Steele, British Columbia, to be Vice-Chairman of the Inuvialuit Arbitration Board, to hold office during pleasure for a term of three years.
Appointment of JUDY A. WHITE, Q.C., of Conne River, Newfoundland and Labrador, to be Chairman of the Inuvialuit Arbitration Board, to hold office during pleasure for a term of three years.
Appointment of GRETA SITTICHINLI of Inuvik, Northwest Territories, to be a member of the Inuvialuit Arbitration Board, to hold office during pleasure for a term of three years.
and I've been able to capture the capitalized names (e.g. DAVID MERRIGAN) with the regex below but I'm struggling to capture the locations, i.e. the 'of' statement following the capitalized name that ends with the second comma. I've tried just isolating the rest of the string that follows the name with the following code, but it just doesn't seem to work, I keep getting -1 as a response.
df_appointments['Name'] = df_appointments['Precis'].str.find(r'\b[A-Z] (?:\s [A-Z] )') df_appointments['Location'] = df_appointments['Precis'].str.find(r'\b[A-Z] (?:\s [A-Z] )\b\s([^\n\r]*)')
Any help showing me how to isolate the location substring with regex (after that I can figure out how to get the position, etc) would be tremendously appreciated. Thank you.
CodePudding user response:
The following pattern works for your sample set:
rgx = r'(?:\w\s) ([A-Z\s\.,] )(?:\sof\s)([A-Za-z\s] ,\s[A-Za-z\s] )'
It uses capture groups & non-capture groups to isolate only the names & locations from the strings. Rather than requiring two patterns, and having to perform two searches, you can then do the following to extract that information into two new columns:
df[['name', 'location']] = df['precis'].str.extract(rgx)
This then produces:
df
precis name location
0 Appointment of... DAVID MERRIGAN Hammonds Plains, Nova Scotia
1 Appointment of... CARLA R. CONKIN Fort Steele, British Columbia
2 Appointment of... JUDY A. WHITE, Q.C., Conne River, Newfoundland and...
3 Appointment of... GRETA SITTICHINLI Inuvik, Northwest Territories`
Depending on the exact format of all of your precis
values, you might have to tweak the pattern to suit perfectly, but hopefully it gets you going...