Home > database >  pandas aggregate list with % for each group
pandas aggregate list with % for each group

Time:06-09

I have dataframe like as below

name,id
    A 1
    A 2
    B 5
    B 5
    B 4
    C 6

I like to do the below

a) Group by name

b) create a list of unique id values for each name

c) populate the list with % of occurrence for each id under each group. For ex: id=5 appears 2 times out of 3 records (for name = B). So, it's percentage will be 66.66%

d) Sort the resulting list items based on %

I tried the below

df.groupby('name')['id'].agg(list) #get list of ids for each name
df.groupby('name')['id'].agg(set)  #get unique ids for each name
df.groupby('name')['id'].nunique()
df.groupby('name')['id'].count()

But this doesn't give the % of id values for each group.

I have a big data of 5 million rows and any efficient solution is helpful

I expect my output to be like as below. List items sorted based on %.

A [1 - 50%,2 - 50%]
B [5 - 66.66%,4 - 33%]
C [6 - 100%]

CodePudding user response:

Use lambda function with Series.value_counts:

df = df.groupby('name')['id'].apply(lambda x: x.value_counts(normalize=True).tolist())
print (df)
name
A                                  [0.5, 0.5]
B    [0.6666666666666666, 0.3333333333333333]
C                                       [1.0]
Name: id, dtype: object

d = {i: x.value_counts(normalize=True).to_dict() for i, x in df.groupby('name')['id']}
print (d)
{'A': {1: 0.5, 2: 0.5}, 'B': {5: 0.6666666666666666, 4: 0.3333333333333333}, 'C': {6: 1.0}}

s = pd.Series(d)
print (s)
A                                  {1: 0.5, 2: 0.5}
B    {5: 0.6666666666666666, 4: 0.3333333333333333}
C                                          {6: 1.0}
dtype: object

s1 = df.groupby('name')['id'].apply(lambda x: x.value_counts(normalize=True).to_dict()).dropna()
print (s1)
name   
A     1    0.500000
      2    0.500000
B     5    0.666667
      4    0.333333
C     6    1.000000
Name: id, dtype: float64

CodePudding user response:

Let us do two groupby's. First groupby to calculate normalized counts of id per name and second one to aggregate normalized counts to list

df.groupby('name')['id'].value_counts(normalize=True).groupby(level=0).agg(list)

name
A                                  [0.5, 0.5]
B    [0.6666666666666666, 0.3333333333333333]
C                                       [1.0]
Name: id, dtype: object

CodePudding user response:

I was able to leverage Counter to come up with this method

from collections import Counter

id_all = df.groupby("name")["id"].agg(list)

def roundToTwo(value, total):
    value = value / total
    value *= 100
    return round(value, 2)


def get_percentage(item: list) -> list:
    c = Counter(item)
    total = sum(c.values())
    percent = {key: roundToTwo(value, total) for key, value in c.items()}
    return percent 


id_percentage = id_all.apply(lambda item: get_percentage(item))

print(id_percentage.head())

Output:

name
A      {1: 50.0, 2: 50.0}
B    {5: 66.67, 4: 33.33}
C              {6: 100.0}
  • Related