I have dataframe like as below
name,id
A 1
A 2
B 5
B 5
B 4
C 6
I like to do the below
a) Group by name
b) create a list of unique id
values for each name
c) populate the list with % of occurrence
for each id
under each group. For ex: id=5
appears 2 times out of 3 records (for name = B). So, it's percentage will be 66.66%
d) Sort the resulting list items based on %
I tried the below
df.groupby('name')['id'].agg(list) #get list of ids for each name
df.groupby('name')['id'].agg(set) #get unique ids for each name
df.groupby('name')['id'].nunique()
df.groupby('name')['id'].count()
But this doesn't give the % of id
values for each group.
I have a big data of 5 million rows and any efficient solution is helpful
I expect my output to be like as below. List items sorted based on %
.
A [1 - 50%,2 - 50%]
B [5 - 66.66%,4 - 33%]
C [6 - 100%]
CodePudding user response:
Use lambda function with Series.value_counts
:
df = df.groupby('name')['id'].apply(lambda x: x.value_counts(normalize=True).tolist())
print (df)
name
A [0.5, 0.5]
B [0.6666666666666666, 0.3333333333333333]
C [1.0]
Name: id, dtype: object
d = {i: x.value_counts(normalize=True).to_dict() for i, x in df.groupby('name')['id']}
print (d)
{'A': {1: 0.5, 2: 0.5}, 'B': {5: 0.6666666666666666, 4: 0.3333333333333333}, 'C': {6: 1.0}}
s = pd.Series(d)
print (s)
A {1: 0.5, 2: 0.5}
B {5: 0.6666666666666666, 4: 0.3333333333333333}
C {6: 1.0}
dtype: object
s1 = df.groupby('name')['id'].apply(lambda x: x.value_counts(normalize=True).to_dict()).dropna()
print (s1)
name
A 1 0.500000
2 0.500000
B 5 0.666667
4 0.333333
C 6 1.000000
Name: id, dtype: float64
CodePudding user response:
Let us do two groupby's. First groupby
to calculate normalized counts of id
per name
and second one to aggregate normalized counts to list
df.groupby('name')['id'].value_counts(normalize=True).groupby(level=0).agg(list)
name
A [0.5, 0.5]
B [0.6666666666666666, 0.3333333333333333]
C [1.0]
Name: id, dtype: object
CodePudding user response:
I was able to leverage Counter to come up with this method
from collections import Counter
id_all = df.groupby("name")["id"].agg(list)
def roundToTwo(value, total):
value = value / total
value *= 100
return round(value, 2)
def get_percentage(item: list) -> list:
c = Counter(item)
total = sum(c.values())
percent = {key: roundToTwo(value, total) for key, value in c.items()}
return percent
id_percentage = id_all.apply(lambda item: get_percentage(item))
print(id_percentage.head())
Output:
name
A {1: 50.0, 2: 50.0}
B {5: 66.67, 4: 33.33}
C {6: 100.0}