I have a list of words as well as a dataframe
data = {'test':['dog is happy', 'dog is hap', 'dog is hap']}
df = pd.DataFrame(data)
list = ['dog', 'hap', 'happy']
df
test
0 dog is happy
1 dog is hap
2 doggy is hap
I'd like to add a column let's call it 'words' so that it will look for the whole word if it's present in the row. If it is, I'd like to add that word to the words column. My requested output would be
df
test words
0 dog is happy dog happy
1 dog is hap dog hap
2 doggy is hap hap
I've found that some posts on SO will return 'hap' on the first line because 'happy' begins with 'hap'. (same concept with dog and doggy in third row) I've also found examples that would return True/False in the words column but I would like to have the actual words in that column. Thanks and glad to clarify any points of confusion.
CodePudding user response:
This is pretty straight forward using set.intersection
:
>>> words = {'dog', 'hap', 'happy'}
>>> df["matches"] = df["test"].str.split().apply(set(words).intersection)
>>> df
test matches
0 dog is happy {happy, dog}
1 dog is hap {dog, hap}
2 doggy is hap {hap}
Of course, if you want your matches in a specific order or as single whitespace-separated words, this won't do, but you probably don't won't those things...
CodePudding user response:
Here is a solution using str.findall()
df.assign(words = df['test'].str.findall('|'.join([r'\b{}\b'.format(i) for i in l])).str.join(' '))
Output:
test words
0 dog is happy dog happy
1 dog is hap dog hap
2 doggy is hap hap
CodePudding user response:
I hope this is what you want.
filter_words = ['dog', 'happy', 'hap']
def add_words(x):
return ' '.join([
token
for token in x.split(' ')
if token in filter_words
])
df['words'] = df['test'].apply(lambda x: add_words(x))