Suppose my dataframe looks as follows:
Build year | Brand |
---|---|
2010 | Mercedes |
2010 | Mercedes |
2010 | BMW |
2010 | Kia |
2011 | Toyota |
2011 | Mercedes |
2011 | Mercedes |
2012 | Tesla |
I want to find all unique combinations of build year and brand, and them count the values and calculate the percentage for each color per year. Currently I have this:
df.groupby(["Build year", "Brand"]).count()
Is there an easy way to convert this to percentage per year? The desired output is:
Build year | Brand | Count | Percentage of annual count |
---|---|---|---|
2010 | Mercedes | 2 | 0.5 |
2010 | BMW | 1 | 0.25 |
2010 | Kia | 1 | 0.25 |
2011 | Toyota | 1 | 0.33 |
2011 | Mercedes | 2 | 0.66 |
2012 | Tesla | 1 | 1 |
CodePudding user response:
Instead of grouping by both Build Year and Brand, you only have to groupby the Build Year, and use .value_counts
:
import pandas as pd
df = pd.read_clipboard() # Your df here
groups = df.groupby("Build year")
count = groups.value_counts()
percentage = groups.value_counts(normalize=True)
out = pd.concat([count, percentage], axis=1, keys=["Count", "Percentage of annual count"])
Count Percentage of annual count
Build year Brand
2010 Mercedes 2 0.500000
BMW 1 0.250000
Kia 1 0.250000
2011 Mercedes 2 0.666667
Toyota 1 0.333333
2012 Tesla 1 1.000000
CodePudding user response:
You can calculate the Percentage of annual count using lambda function, see below :
grouped_df = df.groupby(["Build year", "Brand"])
counts = grouped_df.size().reset_index(name='Count')
counts['Percentage of annual count'] = grouped_df.apply(lambda x: 100 * x.count() / x.count().sum())
Example of full code
import pandas as pd
data = {'Build year': [2010, 2010, 2010, 2010, 2011, 2011, 2011, 2012],
'Brand': ['Mercedes', 'Mercedes', 'BMW', 'Kia', 'Toyota', 'Mercedes', 'Mercedes', 'Tesla']}
df = pd.DataFrame(data)
grouped_df = df.groupby(["Build year", "Brand"])
counts = grouped_df.size().reset_index(name='Count')
counts['Percentage of annual count'] = grouped_df.apply(lambda x: 100 * x.count() / x.count().sum())
print(counts)
Output
Build year Brand Count Percentage of annual count
0 2010 Mercedes 2 50.00
1 2010 BMW 1 25.00
2 2010 Kia 1 25.00
3 2011 Toyota 1 33.33
4 2011 Mercedes 2 66.67
5 2012 Tesla 1 100.00
Hope it helps
CodePudding user response:
A method chaining approach (assuming names without spaces), could look like this:
(
df.groupby(["build_year","brand"])
.agg(count=('build_year', "count"))
.assign(Percentage_of_annual_count =
lambda x: x["count"]/x.groupby("build_year")["count"]
.transform(lambda x: sum(x)))
)
Full solution:
df = pd.DataFrame({
"build_year":[2010,2010,2010,2010,2011,2011,2011,2012,],
"brand":["Mercedes","Mercedes","BMW","Kia","Toyota","Mercedes","Mercedes","Tesla",]
})
(
df.groupby(["build_year","brand"])
.agg(count=('build_year', "count"))
.assign(Percentage_of_annual_count =
lambda x: x["count"]/x.groupby("build_year")["count"]
.transform(lambda x: sum(x)))
)
Output:
count Percentage_of_annual_count
build_year brand
2010 BMW 1 0.250000
Kia 1 0.250000
Mercedes 2 0.500000
2011 Mercedes 2 0.666667
Toyota 1 0.333333
2012 Tesla 1 1.000000