Home > front end >  Extract a substring from a column and replace column data frame
Extract a substring from a column and replace column data frame

Time:08-04

I need some help extracting a substring from a column in my data frame and then replacing that column with a substring. I was wondering if python would be better performance for stripping the string or using regular expression to substitute/replace the string with the substring.

The string looks something like this in the column:

Person
------
<Person 1234567 Tom Brady>
<Person 456789012 Mary Ann Thomas>
<Person 92145 John Smith>

What I would like is this:

Person
------
Tom Brady
Mary Ann Thomas
John Smith

What I have so far as far as regular expressions go is this:

/^([^.] [.] [^.] )[.]/g

And that just gets this part '<Person 1234567 ', not sure how to get the '>' from the end.

CodePudding user response:

You can first identify all the alphabets in keeping things simple with this code

res =  re.findall(r"[^()0-9-] ", string)
res[1]

This should return you a list of strings ['Person', 'Tom Brady'], you can then access the name of the Person with res[1]

**Remark: I have yet to try the code, in the case that it also returns spaces, you should be able to easily remove them with strip() or it should be the the third string of the list res[3] instead.

You can read more about re.findall() online or through the documentation.

CodePudding user response:

Python regex has a function called search that finds the matching pattern in a string. With the examples given, you can use regex to extract the names with:

import re
s = "<Person 1234567 John Smith>"
re.search("[A-Z][a-z] (\s[A-Z][a-z] ) ", s).group(0)
>>> 'John Smith'

The regular expression [A-Z][a-z] (\s[A-Z][a-z] ) is just matching the names (Tom Brady, Mary Ann Thomas, etc.)

I like to use Panda's apply function to apply an operation on each row, so the final result would look like this:

import re
import pandas as pd

def extract_name(row):
    row["Person"] = re.search("[A-Z][a-z] (\s[A-Z][a-z] ) ", row["Person"]).group(0)
    return row

df = YOUR DATAFRAME
df2 = df.apply(extract_name, axis=1)

and df2 has the Person column with the extracted names.

  • Related