Problem Statement
I have a pandas dataframe in which one of the column's values is of type list. I need to get the frequency of each item on that particular list.
For example:
import pandas as pd
data = [
{
"name": "fruits",
"values": ["apple", "banana", "cherry", "apple", "mango", "banana", "apple"]
},
{
"name": "cars",
"values": ["Audi", "Ferrari", "Ferrari", "Audi", "honda", "Audi"]
},
{
"name": "animals",
"values": ["dogs", "cats", "tiger", "tiger", "cats", "cats", "camel"]
}
]
df = pd.DataFrame(data)
Here if we print df, we will see the following dataframe.
name values
0 fruits [apple, banana, cherry, apple, mango, banana, ...
1 cars [Audi, Ferrari, Ferrari, Audi, honda, Audi]
2 animals [dogs, cats, tiger, tiger, cats, cats, camel]
Now, I would like to count the frequency of occurrence of each item in each row of values columns.
What I did
The solution I did is not efficient, but it was what I came up with so far. So trying to get the best possible way to solve it.
I took the help of python loops to count the frequency and again convert it back to the dataframe.
frequency_list = []
for idx, row in df.iterrows():
frequency = [{ "name": row["name"], "value": x, "frequency": row["values"].count(x)} for x in list(set(row["values"]))]
# sorting and getting top 5 frequency is optional
frequency_list.append(sorted(frequency, key=lambda x: x["frequency"], reverse=True)[:5])
On printing frequency_list we will get.
[[{'name': 'fruits', 'value': 'apple', 'frequency': 3},
{'name': 'fruits', 'value': 'banana', 'frequency': 2},
{'name': 'fruits', 'value': 'cherry', 'frequency': 1},
{'name': 'fruits', 'value': 'mango', 'frequency': 1}],
[{'name': 'cars', 'value': 'Audi', 'frequency': 3},
{'name': 'cars', 'value': 'Ferrari', 'frequency': 2},
{'name': 'cars', 'value': 'honda', 'frequency': 1}],
[{'name': 'animals', 'value': 'cats', 'frequency': 3},
{'name': 'animals', 'value': 'tiger', 'frequency': 2},
{'name': 'animals', 'value': 'camel', 'frequency': 1},
{'name': 'animals', 'value': 'dogs', 'frequency': 1}]]
Now I started to create a dataframe for each item in frequency_list and concatenate them.
frequency_df = pd.DataFrame()
for each_frequency in frequency_list:
temp_df = pd.DataFrame(each_frequency)
if frequency_df.empty:
frequency_df = temp_df
else:
frequency_df = pd.concat((frequency_df, temp_df), axis=0, ignore_index=True)
The data that frequency_df is holding looks like this:
name value frequency
0 fruits apple 3
1 fruits banana 2
2 fruits cherry 1
3 fruits mango 1
4 cars Audi 3
5 cars Ferrari 2
6 cars honda 1
7 animals cats 3
8 animals tiger 2
9 animals camel 1
10 animals dogs 1
EXPECTED OUTPUT
frequency
name value
animals camel 1
cats 3
dogs 1
tiger 2
cars Audi 3
Ferrari 2
honda 1
fruits apple 3
banana 2
cherry 1
mango 1
CodePudding user response:
Try:
print(
df.explode("values")
.groupby(["name", "values"])
.size()
.to_frame(name="frequency")
)
Prints:
frequency
name values
animals camel 1
cats 3
dogs 1
tiger 2
cars Audi 3
Ferrari 2
honda 1
fruits apple 3
banana 2
cherry 1
mango 1
CodePudding user response:
IIUC, you can try
out = (df.groupby('name')
.apply(lambda g: g[['values']].explode('values').value_counts())
.to_frame('frequency'))
print(out)
frequency
name values
animals cats 3
tiger 2
camel 1
dogs 1
cars Audi 3
Ferrari 2
honda 1
fruits apple 3
banana 2
cherry 1
mango 1
CodePudding user response:
Let us explode
on values
then do value_counts
df.explode('values').value_counts().sort_index()
name values
animals camel 1
cats 3
dogs 1
tiger 2
cars Audi 3
Ferrari 2
honda 1
fruits apple 3
banana 2
cherry 1
mango 1
dtype: int64