How to correctly extract substring from a string in a Pandas data frame?-CodePudding

I have a list of about 300 concepts and a pandas dataframe composed by two col: Abstractand Title. Some concepts from the list are in the Abstract as substring. I would like to extract from the Abstract the concepts from the list and use the extracted concepts as label for my record.

I am using the .str.extract function. I tried manually entering one of the concepts from the list

dataset["Indexes"]= dataset["Abstract"].str.extract("(land reform)")

works as expected. However, since the list is about 300 concepts I would like use the list as pandas dataframe and use the dataframe as reference to extract the concept from the abstract.

I have defined the list as dataframe


tofind = landvoc["English"]

then I have modified what I was using before as follows


dataset["Indexes"]= dataset["Abstract"].str.extract(tofind)

but I got this error

TypeError: unhashable type: 'Series'

Any suggestions about how to solve this issue?

Thanks

CodePudding user response：

You can use str.findall:

concepts = fr"({'|'.join(tofind)})"
df['Indexes'] = df['Abstract'].str.findall(concepts).str.join(', ')
print(df)

# Output
              Abstract       Indexes
0  black and small cat  black, small

Setup:

tofind = ['small', 'black']
df = pd.DataFrame({'Abstract': ['black and small cat']})