I have a Python dataframe column Name
who's elements always contain a first name, last name, and the word "over" or "under"
For example: Name
= [Michael Johnson Over, Michael Johnson Under, John Smith Over, John Smith Under]
I'm trying to create a new column Name2
that extracts either "Over" or "Under" from Name
So for the example above Name2
= [Over, Under, Over, Under]
I've tried different variations of .split
& findall
but can't figure out how to get a new column that just has Over
or Under
in it, please help!
CodePudding user response:
.str
is a property on pd.Series
that exposes string-parsing functionality such as .contains
. You can set a new column with boolean indexing where the condition is whether or not the row in "Name"
contains the keywords "Over"
or "Under"
.
import pandas as pd
df = pd.DataFrame(
{
"Name": [
"Michael Johnson Over",
"Michael Johnson Under",
"John Smith Over",
"John Smith Under"
],
}
)
df["Name2"] = None
df["Name2"][df["Name"].str.contains("Over")] = "Over"
df["Name2"][df["Name"].str.contains("Under")] = "Under"
print(df)
Output
Name Name2
0 Michael Johnson Over Over
1 Michael Johnson Under Under
2 John Smith Over Over
3 John Smith Under Under
CodePudding user response:
You can use Pandas rsplit
to split the string from the end, and use n
parameter to limit number of splits in output to one. You can also use the expand=True
to split strings into separate columns.
df[['First_Last','Name2']] = df['Name'].str.rsplit(' ', n=1, expand=True)
Output
Name First_Last Name2
0 Michael Johnson Over Michael Johnson Over
1 Michael Johnson Under Michael Johnson Under
2 John Smith Over John Smith Over
3 John Smith Under John Smith Under