I have a list of about 300 concepts and a pandas dataframe composed by two col: Abstract
and Title
. Some concepts from the list are in the Abstract as substring. I would like to extract from the Abstract the concepts from the list and use the extracted concepts as label for my record.
I am using the .str.extract
function. I tried manually entering one of the concepts from the list
dataset["Indexes"]= dataset["Abstract"].str.extract("(land reform)")
works as expected. However, since the list is about 300 concepts I would like use the list as pandas dataframe and use the dataframe as reference to extract the concept from the abstract.
I have defined the list as dataframe
tofind = landvoc["English"]
then I have modified what I was using before as follows
dataset["Indexes"]= dataset["Abstract"].str.extract(tofind)
but I got this error
TypeError: unhashable type: 'Series'
Any suggestions about how to solve this issue?
Thanks
CodePudding user response:
You can use str.findall
:
concepts = fr"({'|'.join(tofind)})"
df['Indexes'] = df['Abstract'].str.findall(concepts).str.join(', ')
print(df)
# Output
Abstract Indexes
0 black and small cat black, small
Setup:
tofind = ['small', 'black']
df = pd.DataFrame({'Abstract': ['black and small cat']})