Count frequencies (unique rows) from a pandas list type column-CodePudding

I have a dataframe (df) like this:

id  col
1  [A, B, C, C]
2  [B, C, D]
3  [C, D, E]

And, I have list like this:

l = ["A", "C", "F"]

For each element in l, I want to count the unique rows they appear in df.

'A': 1, 'C': 3, 'F': 0

But I'm not getting the part where I can check if the value exists in the list-column of the dataframe.

d = {}
for i in l:
    df_tmp = df[i.isin(df['col'])]['id'] ## wrong, showing error, isin is not a string attribute
    d[i] = len(df_tmp)

Anyway I can fix this? Or is there a more cleaner/efficient way?

N.B. There is a similar question Frequency counts for a pandas column of lists, but it is different as I have an external list to check the frequency.

CodePudding user response：

Here we are using apply method that applies given function to each element of the column (in our case it is the function that tests whether an element belongs to the list or not), then we sum True values, i.e. rows in which we found requested values and eventually save it to the dictionary. And we do it for all requested letters. I have not tested performance of this solution.

    import pandas as pd
    
    df = pd.DataFrame([
        {'id': 1, 'col': ['A', 'B', 'C', 'C']},
        {'id': 2, 'col': ['B', 'C', 'D']},
        {'id': 3, 'col': ['C', 'D', 'E']}])
    
    letters = ["A", "C", "D", "F"]
    res = {v: df['col'].apply(lambda x: v in x).sum()
           for v in letters}

    # output
    # {'A': 1, 'C': 3, 'D': 2, 'F': 0}

CodePudding user response：

You can just check the membership in the list for each value in ['A', 'C', 'F'] and compute sum() like:

vals = ['A', 'C', 'F']
{val: df['col'].apply(lambda x: val in x).sum() for val in vals}

output:

{'A': 1, 'C': 3, 'F': 0}

CodePudding user response：

You can explode col column and keep rows where value in l list then use value_counts() to count value in Series.

l = ["A", "C", "D", "F"]

col = df['col'].apply(set).explode(ignore_index=True)
out = col[col.isin(l)].value_counts().reindex(l, fill_value=0).to_dict()

# or without define `col`

out = (df['col'].apply(set).explode(ignore_index=True)
       [lambda d: d.isin(l)]
       .value_counts().reindex(l, fill_value=0).to_dict())

print(out)

{'A': 1, 'C': 3, 'D': 2, 'F': 0}