I need some help extracting a substring from a column in my data frame and then replacing that column with a substring. I was wondering if python would be better performance for stripping the string or using regular expression to substitute/replace the string with the substring.
The string looks something like this in the column:
Person
------
<Person 1234567 Tom Brady>
<Person 456789012 Mary Ann Thomas>
<Person 92145 John Smith>
What I would like is this:
Person
------
Tom Brady
Mary Ann Thomas
John Smith
What I have so far as far as regular expressions go is this:
/^([^.] [.] [^.] )[.]/g
And that just gets this part '<Person 1234567 ', not sure how to get the '>' from the end.
CodePudding user response:
You can first identify all the alphabets in keeping things simple with this code
res = re.findall(r"[^()0-9-] ", string)
res[1]
This should return you a list of strings ['Person', 'Tom Brady']
, you can then access the name of the Person with res[1]
**Remark: I have yet to try the code, in the case that it also returns spaces, you should be able to easily remove them with strip()
or it should be the the third string of the list res[3]
instead.
You can read more about re.findall()
online or through the documentation.
CodePudding user response:
Python regex has a function called search
that finds the matching pattern in a string. With the examples given, you can use regex to extract the names with:
import re
s = "<Person 1234567 John Smith>"
re.search("[A-Z][a-z] (\s[A-Z][a-z] ) ", s).group(0)
>>> 'John Smith'
The regular expression [A-Z][a-z] (\s[A-Z][a-z] )
is just matching the names (Tom Brady, Mary Ann Thomas, etc.)
I like to use Panda's apply
function to apply an operation on each row, so the final result would look like this:
import re
import pandas as pd
def extract_name(row):
row["Person"] = re.search("[A-Z][a-z] (\s[A-Z][a-z] ) ", row["Person"]).group(0)
return row
df = YOUR DATAFRAME
df2 = df.apply(extract_name, axis=1)
and df2
has the Person column with the extracted names.