Home > Blockchain >  Searching Pandas column for words in list and adding found words into new column
Searching Pandas column for words in list and adding found words into new column

Time:05-12

I have a list of words as well as a dataframe

data = {'test':['dog is happy', 'dog is hap', 'dog is hap']}
df = pd.DataFrame(data)

list = ['dog', 'hap', 'happy']

df 
           test
0    dog is happy
1    dog is hap
2    doggy is hap

I'd like to add a column let's call it 'words' so that it will look for the whole word if it's present in the row. If it is, I'd like to add that word to the words column. My requested output would be

df

           test      words
0  dog is happy  dog happy
1  dog is hap    dog hap
2  doggy is hap  hap

I've found that some posts on SO will return 'hap' on the first line because 'happy' begins with 'hap'. (same concept with dog and doggy in third row) I've also found examples that would return True/False in the words column but I would like to have the actual words in that column. Thanks and glad to clarify any points of confusion.

CodePudding user response:

This is pretty straight forward using set.intersection:

>>> words = {'dog', 'hap', 'happy'}
>>> df["matches"] = df["test"].str.split().apply(set(words).intersection)
>>> df
           test       matches
0  dog is happy  {happy, dog}
1    dog is hap    {dog, hap}
2  doggy is hap         {hap}

Of course, if you want your matches in a specific order or as single whitespace-separated words, this won't do, but you probably don't won't those things...

CodePudding user response:

Here is a solution using str.findall()

df.assign(words = df['test'].str.findall('|'.join([r'\b{}\b'.format(i) for i in l])).str.join(' '))

Output:

           test      words
0  dog is happy  dog happy
1    dog is hap    dog hap
2  doggy is hap        hap

CodePudding user response:

I hope this is what you want.

filter_words = ['dog', 'happy', 'hap']
def add_words(x):
    return ' '.join([
        token 
        for token in x.split(' ') 
        if token in filter_words
    ])

df['words'] = df['test'].apply(lambda x: add_words(x))
  • Related