Home > Software engineering >  Count values in a dataframe column of lists
Count values in a dataframe column of lists

Time:07-26

I have a column in a dataframe that looks something like this

0                                                   NaN
1                                              ["arts"]
2                                       ["sports", "tech"]
3     ["arts", "finance", "health", "sports", "science"...
4                            ["finance", "sports", "tech"]
5                    ["arts", "finance", "sports", "tech"]
6     ["arts", "finance", "health", "sports", "science"...
7                            ["arts", "sports", "science"]

I would love to know how many times Arts occured across all these lists of lists. However, upon trying out column.explode().value_counts(sort=True) I just get a basic distribution with options which is not what I want.

["tech"]                                                   5
["arts", "finance", "sports", "tech"]                         2
["arts", "sports"]                                            2
["finance", "sports"]                                         1
["arts"]                                                   1

I even tried using counters collections.Counter(itertools.chain.from_iterable(v.split(',') for v in column)) but I get the following error 'float' object has no attribute 'split'

Any pointers?

CodePudding user response:

If column is not too long, a simple nested loop should work just fine:

count = 0
for str_list in column:
    for name in str_list:
        if name == "arts":
            count  = 1

print(count)

CodePudding user response:

You can create a mask of rows that contain arts by doing something like:

mask = df['industry'].apply(lambda x: 'comedy' in x)

And then restrict your dataframe to your new mask

df = df[mask]

From there you should just be able to use len(df) or something like that.

  • Related