Home > Enterprise >  Build a vocabulary of distinct words from pandas
Build a vocabulary of distinct words from pandas

Time:04-13

I have a Series with index named "Article Number" and a column named "Text". Series looks like this:

Article ID   |   Text
  Article#1     This is a beautiful day
  Article#2     I love you
  Article#3     This is too late
  Article#4     Love you back
  Article#5     This is a lovely day

distinct_words = ['This', 'beautiful', 'day']

I'd like to create a dictionary which its key is the distinct word and its value is list of article it was in. so for example above it would be:

vocabulary = {"This":[Article#1, Article#3], "beautiful":[Article#1], "day":[Article#1, Article#5}
      

what i have written is:

vocabulary ={}
for word in distinct_words:
    filt = df.str.findall(word)
    vocabulary[word] = df.loc[filt].index

However i get this error:

TypeError: unhashable type: 'list'

Can someone help me with this problem? I have tried a nested loop but since my original file is large it takes minutes, but the problem needs to be computed under 40 secs. I was told using re module would be great.

CodePudding user response:

Series.str.findall() returns matched value as list. You can use Series.str.contains() to find if column contains value.

for word in distinct_words:
    vocabulary[word] = df[df['Text'].str.contains(word)].index.tolist()
  • Related