Say I have a dataframe
import pandas as pd
df = pd.DataFrame({"id":[1,1,1,1,2,2,2],
"car":["Volvo","Audi","Volvo","Volvo","VW","Audi","Volvo"]})
and I want to get the count-ratio for each car
within each id
aswell as the count i.e the resulting dataframe would be
ratio count
car
id
1 Audi 0.25 1
Volvo 0.75 3
2 Audi 0.33 1
Volvo 0.33 1
VW 0.33 1
I have tried
def get_tag_ratio(x):
"""
Calculate the ratio for each car
"""
ratio = x.value_counts(normalize=True)
return ratio
car_info = df.groupby(["id"])["car"].agg(
ratio=get_tag_ratio, count="count")
but the result is
ratio count
id
1 [0.75,0.25] 4
2 [0.33,0.33,0.33] 3
I have also tried
car_info = df.groupby(["id","car"])["car"].agg(
ratio=get_tag_ratio, count="count")
which gives
ratio count
id car
1 Audi 1.0 1
Volvo 1.0 3
2 Audi 1.0 1
Volvo 1.0 1
VW 1.0 1
which is almost there - unless the value_counts
now is applied within each car
group and not each id
group.
Just parsing "value_counts"
doens't work (also, I don't know how to get "normalize=True" as an argument to value_counts
here)
df.groupby("id")["car"].agg(["value_counts","count"]) #`ValueError: Must pass non-zero number of levels/codes`
Is there a way to just to it in the .agg
call, or do I have to join another dataframe with the value_counts? I have quite a lot of id
and car
so the groupby
operations a rather timeconsuming, thus I prefer only having to do one "call" of groupby.agg
CodePudding user response:
Use Series.value_counts
without parameter and with parameter normalize
with DataFrame
cosntructor:
def get_tag_ratio(x):
"""
Calculate the ratio for each car
"""
ratio = pd.DataFrame({'ratio': x.value_counts(normalize=True),
'count': x.value_counts()})
return ratio
car_info = df.groupby(["id"])["car"].apply(get_tag_ratio)
print (car_info)
ratio count
id
1 Volvo 0.750000 3
Audi 0.250000 1
2 Audi 0.333333 1
VW 0.333333 1
Volvo 0.333333 1
Or join 2 MultiIndex Series
:
g = df.groupby("id")["car"]
car_info = pd.concat([g.value_counts(normalize=True),
g.value_counts()], axis=1, keys=['ratio','count'])
print (car_info)
ratio count
id car
1 Volvo 0.750000 3
Audi 0.250000 1
2 Audi 0.333333 1
VW 0.333333 1
Volvo 0.333333 1
CodePudding user response:
One option with a double groupby.value_counts
:
g = df.groupby('id')['car']
out = (g
.value_counts().to_frame(name='count')
.assign(ratio=g.value_counts(normalize=True))
)
Alternative, using a manual calculation:
s = df.groupby(['id', 'car']).size()
out = pd.concat({'count': s,
'ratio': s.div(s.groupby(level='id').transform('sum'))},
axis=1)
output:
count ratio
id car
1 Volvo 3 0.750000
Audi 1 0.250000
2 Audi 1 0.333333
VW 1 0.333333
Volvo 1 0.333333