Home > OS >  Using regex to capture substring within a pandas df
Using regex to capture substring within a pandas df

Time:08-26

I’m trying to extract specific substrings from larger phrases contained in my Pandas dataframe. I have rows formatted like so:

Appointment of DAVID MERRIGAN of Hammonds Plains, Nova Scotia, to be a member of the Inuvialuit Arbitration Board, to hold office during pleasure for a term of three years.
Appointment of CARLA R. CONKIN of Fort Steele, British Columbia, to be Vice-Chairman of the Inuvialuit Arbitration Board, to hold office during pleasure for a term of three years.
Appointment of JUDY A. WHITE, Q.C., of Conne River, Newfoundland and Labrador, to be Chairman of the Inuvialuit Arbitration Board, to hold office during pleasure for a term of three years.
Appointment of GRETA SITTICHINLI of Inuvik, Northwest Territories, to be a member of the Inuvialuit Arbitration Board, to hold office during pleasure for a term of three years.

and I've been able to capture the capitalized names (e.g. DAVID MERRIGAN) with the regex below but I'm struggling to capture the locations, i.e. the 'of' statement following the capitalized name that ends with the second comma. I've tried just isolating the rest of the string that follows the name with the following code, but it just doesn't seem to work, I keep getting -1 as a response.

df_appointments['Name'] = df_appointments['Precis'].str.find(r'\b[A-Z] (?:\s [A-Z] )') df_appointments['Location'] = df_appointments['Precis'].str.find(r'\b[A-Z] (?:\s [A-Z] )\b\s([^\n\r]*)')

Any help showing me how to isolate the location substring with regex (after that I can figure out how to get the position, etc) would be tremendously appreciated. Thank you.

enter image description here

CodePudding user response:

The following pattern works for your sample set:

rgx = r'(?:\w\s) ([A-Z\s\.,] )(?:\sof\s)([A-Za-z\s] ,\s[A-Za-z\s] )'

It uses capture groups & non-capture groups to isolate only the names & locations from the strings. Rather than requiring two patterns, and having to perform two searches, you can then do the following to extract that information into two new columns:

df[['name', 'location']] = df['precis'].str.extract(rgx)

This then produces:

df

   precis               name                        location
0  Appointment of...    DAVID MERRIGAN          Hammonds Plains, Nova Scotia
1  Appointment of...    CARLA R. CONKIN         Fort Steele, British Columbia
2  Appointment of...    JUDY A. WHITE, Q.C.,    Conne River, Newfoundland and...  
3  Appointment of...    GRETA SITTICHINLI       Inuvik, Northwest Territories`

Depending on the exact format of all of your precis values, you might have to tweak the pattern to suit perfectly, but hopefully it gets you going...

  • Related