Home > database >  Get the frequency of individual items in a list of each row of a column in a dataframe
Get the frequency of individual items in a list of each row of a column in a dataframe

Time:05-26

Problem Statement

I have a pandas dataframe in which one of the column's values is of type list. I need to get the frequency of each item on that particular list.

For example:

import pandas as pd
data = [
    {
        "name": "fruits",
        "values": ["apple", "banana", "cherry", "apple", "mango", "banana", "apple"]
    },
    {
        "name": "cars",
        "values": ["Audi", "Ferrari", "Ferrari", "Audi", "honda", "Audi"]
    },
    {
        "name": "animals",
        "values": ["dogs", "cats", "tiger", "tiger", "cats", "cats", "camel"]
    }
]
df = pd.DataFrame(data)

Here if we print df, we will see the following dataframe.

    name    values
0   fruits  [apple, banana, cherry, apple, mango, banana, ...
1   cars    [Audi, Ferrari, Ferrari, Audi, honda, Audi]
2   animals [dogs, cats, tiger, tiger, cats, cats, camel]

Now, I would like to count the frequency of occurrence of each item in each row of values columns.

What I did

The solution I did is not efficient, but it was what I came up with so far. So trying to get the best possible way to solve it.

I took the help of python loops to count the frequency and again convert it back to the dataframe.

frequency_list = []
for idx, row in df.iterrows():
    frequency = [{ "name": row["name"], "value": x, "frequency": row["values"].count(x)} for x in list(set(row["values"]))]
    # sorting and getting top 5 frequency is optional
    frequency_list.append(sorted(frequency, key=lambda x: x["frequency"], reverse=True)[:5])

On printing frequency_list we will get.

[[{'name': 'fruits', 'value': 'apple', 'frequency': 3},
  {'name': 'fruits', 'value': 'banana', 'frequency': 2},
  {'name': 'fruits', 'value': 'cherry', 'frequency': 1},
  {'name': 'fruits', 'value': 'mango', 'frequency': 1}],
 [{'name': 'cars', 'value': 'Audi', 'frequency': 3},
  {'name': 'cars', 'value': 'Ferrari', 'frequency': 2},
  {'name': 'cars', 'value': 'honda', 'frequency': 1}],
 [{'name': 'animals', 'value': 'cats', 'frequency': 3},
  {'name': 'animals', 'value': 'tiger', 'frequency': 2},
  {'name': 'animals', 'value': 'camel', 'frequency': 1},
  {'name': 'animals', 'value': 'dogs', 'frequency': 1}]]

Now I started to create a dataframe for each item in frequency_list and concatenate them.

frequency_df = pd.DataFrame()
for each_frequency in frequency_list:
    temp_df = pd.DataFrame(each_frequency)
    if frequency_df.empty:
        frequency_df = temp_df
    else:
        frequency_df = pd.concat((frequency_df, temp_df), axis=0, ignore_index=True)

The data that frequency_df is holding looks like this:

    name    value   frequency
0   fruits  apple   3
1   fruits  banana  2
2   fruits  cherry  1
3   fruits  mango   1
4   cars    Audi    3
5   cars    Ferrari 2
6   cars    honda   1
7   animals cats    3
8   animals tiger   2
9   animals camel   1
10  animals dogs    1

EXPECTED OUTPUT

                frequency
name    value   
animals camel   1
        cats    3
        dogs    1
        tiger   2
cars    Audi    3
        Ferrari 2
        honda   1
fruits  apple   3
        banana  2
        cherry  1
        mango   1

CodePudding user response:

Try:

print(
    df.explode("values")
    .groupby(["name", "values"])
    .size()
    .to_frame(name="frequency")
)

Prints:

                 frequency
name    values            
animals camel            1
        cats             3
        dogs             1
        tiger            2
cars    Audi             3
        Ferrari          2
        honda            1
fruits  apple            3
        banana           2
        cherry           1
        mango            1

CodePudding user response:

IIUC, you can try

out = (df.groupby('name')
       .apply(lambda g: g[['values']].explode('values').value_counts())
       .to_frame('frequency'))
print(out)

                 frequency
name    values
animals cats             3
        tiger            2
        camel            1
        dogs             1
cars    Audi             3
        Ferrari          2
        honda            1
fruits  apple            3
        banana           2
        cherry           1
        mango            1

CodePudding user response:

Let us explode on values then do value_counts

df.explode('values').value_counts().sort_index()

name     values 
animals  camel      1
         cats       3
         dogs       1
         tiger      2
cars     Audi       3
         Ferrari    2
         honda      1
fruits   apple      3
         banana     2
         cherry     1
         mango      1
dtype: int64
  • Related