Home > Enterprise >  Find common elements in series of lists
Find common elements in series of lists

Time:09-15

I have a pandas series containing lists of tokens of strings. I want to find common elements among all the lists and along with their count (it must not be unique, bring all the elements with its count across the series). what I am currently doing is making a dictionary from pandas series and counting frequency of terms

ham_tokens = {}
for l in df_ham.tokens:
    for t in l:
        if ham_tokens.get(t):
            ham_tokens[t] =1
        else:
            ham_tokens[t]=1

here is snapshot of my data

0  [we, have, difficulties, delivering, your, EMOTION, no, due, to, unpaid, shipping, freight, htpps, cuidaragora, php]
1  [costcoreward, your, EMOTION, cash, back, has, been, remunerated, sorry, for, the, delay, click]
2  [your, civil, verdict, has, been, finaiized, get, your, payment, by, URLBRAND, juristalawll, bch]
3  [need, quick, cash, get, up, to, cash, loan, in, minutes, no, credit, needed, same, day, funding, apply, now, reply, stop, to, remove]
4  [authmsg, BRAND, verification, is, dont, share, to, anyone, else, EMOTION, id, account, cannot, access, rightnow, bit, ly]

what I need is the a pandas method or any other efficient(loop-less) which can handle this problem.

CodePudding user response:

As @Mustafa Aydın suggests you can use .explode() to create a pandas series containing all words. Then using .value_counts() you can count the number of occurances. Finally we can make a dictionary from this using dict():

dict(df_series.explode().value_counts())

For example:

>>> df_series
0       [a, b, c]
1       [a, c, d]
2    [q, c, b, c]
Name: 0, dtype: object

>>> df_series.explode()
0    a
0    b
0    c
1    a
1    c
1    d
2    q
2    c
2    b
2    c
Name: 0, dtype: object

>>> df_series.explode().value_counts()
c    4
a    2
b    2
d    1
q    1
Name: 0, dtype: int64

>>> dict(df_series.explode().value_counts())
{'c': 4, 'a': 2, 'b': 2, 'd': 1, 'q': 1}

CodePudding user response:

You can try chain your lists with tokens in big list and then use counter for it.

from collections import Counter
from itertools import chain

texts = chain.from_iterable(df["text"])
count = Counter(texts)

print(count.items())

Example output:

dict_items([('we', 2), ('i', 2), ('cash', 2), ('get', 2), ('been', 1)])
  • Related