I'm having some trouble fixing the following problem: I have a dataframe with tokenised text on every row that looks (something) like the following
index feelings
1 [happy, happy, sad]
2 [neutral, sad, mad]
3 [neutral, neutral, happy]
and lists of words lst1=[happy, fantastic]
, lst2=[mad, sad]
, lst3=[neutral]
and I want to check for every row in my dataframe how many occurrences of the words in the list there are. So the output would look something like this:
index feelings occlst1 occlst2 occlst3
1 [happy, happy, sad] 2 1 0
2 [neutral, sad, mad] 0 2 1
3 [neutral, neutral, happy] 1 0 2
So, I want to make a new column and compare the dataframe cells to the list.
Thanks in advance!
CodePudding user response:
Use collections.Counter
Setup:
import pandas as pd
from collections import Counter # Load 'Counter'
df = pd.DataFrame({'feelings': [['happy', 'happy', 'sad'],
['neutral', 'sad', 'mad'],
['neutral', 'neutral', 'happy']]})
lst1 = ['happy', 'fantastic']
lst2 = ['mad', 'sad']
lst3 = ['neutral']
# Create an intermediate dict
occ = {'occlst1': lst1, 'occlst2': lst2, 'occlst3': lst3}
Update: as suggested by @mozway
def count_occ(sr):
return {col: sum([v for k, v in Counter(sr).items() if k in lst])
for col, lst in occ.items()}
df = pd.concat([df, df['feelings'].apply(count_occ).apply(pd.Series)], axis=1)
Note: I didn't use any other columns except feelings
for readability. However the concat
function restore all columns from df
.
Output:
>>> df
feelings occlst1 occlst2 occlst3
0 [happy, happy, sad] 2 1 0
1 [neutral, sad, mad] 0 2 1
2 [neutral, neutral, happy] 1 0 2
CodePudding user response:
You could built a reference Series, to match feelings with the list id. Then explode
merge
pivot_table
:
ref = pd.Series({e: 'occlist_%s' % (i 1) for i,l in enumerate([lst1, lst2, lst3]) for e in l}, name='cols')
## ref:
# happy occlst1
# fantastic occlst1
# mad occlst2
# sad occlst2
# neutral occlst3
# Name: cols, dtype: object
df.merge((df.explode('feelings') # lists to single rows
# create a new column with list id
.merge(ref, left_on='feelings', right_index=True)
# reshape back to 1 row per original index
.pivot_table(index='index', columns='cols', values='feelings', aggfunc='count', fill_value=0)
),
left_on='index', right_index=True # merge with original df
)
NB. I considered here that index
is a column, if is is an index, you need to add a df.reset_index()
step
output:
index feelings occlist_1 occlist_2 occlist_3
0 1 [happy, happy, sad] 2 1 0
1 2 [neutral, sad, mad] 0 2 1
2 3 [neutral, neutral, happy] 1 0 2
input:
df = pd.DataFrame({'index': [1, 2, 3],
'feelings': [['happy', 'happy', 'sad'],
['neutral', 'sad', 'mad'],
['neutral', 'neutral', 'happy']
]})
lst1=['happy', 'fantastic']
lst2=['mad', 'sad']
lst3=['neutral']