Fastest Way to Iterate Over Dataframe Column to Find Match in Strings-CodePudding

Here's a very truncated extract from a large dataframe:

name	age	city
ben	66	NY
rob	45	LON
james	22	LA

I also have a numerous strings that each contain different words but will contain one (not more) of the values in the name column.

For example:

"rob was born in London"
"ben once lived in New York"

For each string I want to iterate over the "name" column to find the name that matches the name in the string and return the age of the person.

So in the first example the desired result is 45 and in the second example the desired result is 66.

I am new to Pandas and am struggling. Can anyone point me in the right direction?

CodePudding user response：

Hope this helps:

List of all strings. This can be part of another dataframe. Just select the column where this values are & convert it to list.

l = ['rob was born in London', "ben once lived in New York"]

The dataframe from your example

df = pd.DataFrame({'name': ['ben', 'rob', 'james'],
                    'age': [66, 45, 22],
                    'city': ['NY', 'LON', 'LA']})

Final dataset where string & age exist.

age_dat = pd.DataFrame()

The first for-loop, loops over names from your original (df) dataset. The second for-loop loops over list of sentences (list l). If any name is found in l, it gets appended in age_dat.

for x in list(df.name):
   
    for z in l:
        if x in z:
            dat = pd.DataFrame()
            dat['string']=[z]
            dat['age'] = [df[df['name']==x].age.tolist()[0]]
       
            age_dat = age_dat.append(dat)

print(age_dat)



                          string  age
0  ben once lived in New York   66
0      rob was born in London   45

CodePudding user response：

Data

s = pd.Series(['rob was born in London', "ben once lived in New York"])
df = pd.DataFrame({'name': ['ben', 'rob', 'james'],
                    'age': [66, 45, 22],
                    'city': ['NY', 'LON', 'LA']})

Solution

who = s.str.extract('('   ')|('.join(df.name)   ')').bfill(axis=1)[0]
age_by_name = dict(zip(df.name, df.age))
pd.DataFrame({'text': s, 'age': who.map(age_by_name)})


                      text  age
0   rob was born in London  45
1   ben once lived in New York  66

Explanation

Use .str.extract to get the name in the string and then match it with the dataframe to get the age.