I am trying to count the occurrence of an element from a list within a dataframe column,
for example:
xlst = ['pak', 'vector', 'word', 'po']
df:
col A, col B, col C
pk-121 abc pak is going great
pk-112 xyz word is word my friend
pk-132 agh vector needs working
pk-321 jkl pak is winning
pk-333 yul vector now
Desired df:
word count
pak 2
word 1
vector 2
CodePudding user response:
You can use a regex to match the words, then drop_duplicates
and value_counts
:
import re
out = (df['col C']
.str.extractall(f"(?P<word>{'|'.join(xlst)})")
.droplevel('match').reset_index()
.drop_duplicates()['word']
.value_counts().reset_index(name='count')
)
Output:
index count
0 pak 2
1 vector 2
2 word 1
Alternative using str.get_dummies
:
out = df['col C'].str.get_dummies(sep=' ').reindex(columns=xlst).sum()
Output:
pak 2.0
vector 2.0
word 1.0
po 0.0
dtype: float64