Home > front end >  Extract Words As A List After A Specicfic Word
Extract Words As A List After A Specicfic Word

Time:03-11

I've got this DataFrame, which is a description of sports and their leagues:

df = pd.DataFrame({
    'text': ['The following leagues were identified: sport: basketball league: NBA sport: soccer league:EPL sport: football league: NFL']
})

df.text.iloc[0]
===============
The following leagues were identified: sport: basketball league: NBA sport: soccer league: EPL sport: football league: NFL

I need to extract all the sport names (which are coming after sport:) and put them as a list in a new column sports. I'm trying the following code:

pat = 'sport:\W (?P<sports>(?:\w ))'
new = df.text.str.extract(pat, expand=True)
df.assign(**new)
===============
                        text                              sports
0   The following leagues were identified: sport: ...   basketball

However, it's returning only the 1st occurrence of the sport as a standalone string, whereas I need all the sports as a list.

Desired output:

               text                               sports
0   The following leagues were ...  [basketball, soccer, football]

What would be the smartest way of doing it? Any suggestions would be appreciated. Thanks!

CodePudding user response:

You can use Series.str.findall:

df['sports'] = df['text'].str.findall(r'sport:\W*(\w )')

Pandas test:

import pandas as pd
df = pd.DataFrame({
    'text': ['The following leagues were identified: sport: basketball league: NBA sport: soccer league:EPL sport: football league: NFL']
})

Output:

>>> df['text'].str.findall('sport:\W*(\w )')
0    [basketball, soccer, football]
Name: text, dtype: object
  • Related