I have a pandas series containing lists of tokens of strings. I want to find common elements among all the lists and along with their count (it must not be unique, bring all the elements with its count across the series). what I am currently doing is making a dictionary from pandas series and counting frequency of terms
ham_tokens = {}
for l in df_ham.tokens:
for t in l:
if ham_tokens.get(t):
ham_tokens[t] =1
else:
ham_tokens[t]=1
here is snapshot of my data
0 [we, have, difficulties, delivering, your, EMOTION, no, due, to, unpaid, shipping, freight, htpps, cuidaragora, php]
1 [costcoreward, your, EMOTION, cash, back, has, been, remunerated, sorry, for, the, delay, click]
2 [your, civil, verdict, has, been, finaiized, get, your, payment, by, URLBRAND, juristalawll, bch]
3 [need, quick, cash, get, up, to, cash, loan, in, minutes, no, credit, needed, same, day, funding, apply, now, reply, stop, to, remove]
4 [authmsg, BRAND, verification, is, dont, share, to, anyone, else, EMOTION, id, account, cannot, access, rightnow, bit, ly]
what I need is the a pandas method or any other efficient(loop-less) which can handle this problem.
CodePudding user response:
As @Mustafa Aydın suggests you can use .explode()
to create a pandas series containing all words. Then using .value_counts()
you can count the number of occurances. Finally we can make a dictionary from this using dict()
:
dict(df_series.explode().value_counts())
For example:
>>> df_series
0 [a, b, c]
1 [a, c, d]
2 [q, c, b, c]
Name: 0, dtype: object
>>> df_series.explode()
0 a
0 b
0 c
1 a
1 c
1 d
2 q
2 c
2 b
2 c
Name: 0, dtype: object
>>> df_series.explode().value_counts()
c 4
a 2
b 2
d 1
q 1
Name: 0, dtype: int64
>>> dict(df_series.explode().value_counts())
{'c': 4, 'a': 2, 'b': 2, 'd': 1, 'q': 1}
CodePudding user response:
You can try chain your lists with tokens in big list and then use counter for it.
from collections import Counter
from itertools import chain
texts = chain.from_iterable(df["text"])
count = Counter(texts)
print(count.items())
Example output:
dict_items([('we', 2), ('i', 2), ('cash', 2), ('get', 2), ('been', 1)])