Here's a very truncated extract from a large dataframe:
name | age | city |
---|---|---|
ben | 66 | NY |
rob | 45 | LON |
james | 22 | LA |
I also have a numerous strings that each contain different words but will contain one (not more) of the values in the name column.
For example:
- "rob was born in London"
- "ben once lived in New York"
For each string I want to iterate over the "name" column to find the name that matches the name in the string and return the age of the person.
So in the first example the desired result is 45 and in the second example the desired result is 66.
I am new to Pandas and am struggling. Can anyone point me in the right direction?
CodePudding user response:
Hope this helps:
List of all strings. This can be part of another dataframe. Just select the column where this values are & convert it to list.
l = ['rob was born in London', "ben once lived in New York"]
The dataframe from your example
df = pd.DataFrame({'name': ['ben', 'rob', 'james'],
'age': [66, 45, 22],
'city': ['NY', 'LON', 'LA']})
Final dataset where string
& age
exist.
age_dat = pd.DataFrame()
The first for-loop, loops over names from your original (df
) dataset. The second for-loop loops over list of sentences
(list l
). If any name is found in l
, it gets appended in age_dat
.
for x in list(df.name):
for z in l:
if x in z:
dat = pd.DataFrame()
dat['string']=[z]
dat['age'] = [df[df['name']==x].age.tolist()[0]]
age_dat = age_dat.append(dat)
print(age_dat)
string age
0 ben once lived in New York 66
0 rob was born in London 45
CodePudding user response:
Data
s = pd.Series(['rob was born in London', "ben once lived in New York"])
df = pd.DataFrame({'name': ['ben', 'rob', 'james'],
'age': [66, 45, 22],
'city': ['NY', 'LON', 'LA']})
Solution
who = s.str.extract('(' ')|('.join(df.name) ')').bfill(axis=1)[0]
age_by_name = dict(zip(df.name, df.age))
pd.DataFrame({'text': s, 'age': who.map(age_by_name)})
text age
0 rob was born in London 45
1 ben once lived in New York 66
Explanation
Use .str.extract
to get the name in the string and then match it with the dataframe to get the age.