Home > Mobile >  How to find percentages of different classes while grouping by year using Groupby function in Pandas
How to find percentages of different classes while grouping by year using Groupby function in Pandas

Time:10-08

So I have a used Audi car database from Kaggle.

Here is what my code looks like to import the dataset:

### headers ###
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### path name ###
data_file_pathname = "etc.."

### import ###
my_data_frame = pd.read_csv(data_file_pathname, sep=",", header=0)

### preview of dataset ###
print(my_data_frame.head())

  model  year  price transmission  mileage fuelType  tax   mpg  engineSize
0    A1  2017  12500       Manual    15735   Petrol  150  55.4         1.4
1    A6  2016  16500    Automatic    36203   Diesel   20  64.2         2.0
2    A1  2016  11000       Manual    29946   Petrol   30  55.4         1.4
3    A4  2017  16800    Automatic    25952   Diesel  145  67.3         2.0
4    A3  2019  17300       Manual     1998   Petrol  145  49.6         1.0

The years range from 1997 to 2021 and I have 3 fuelTypes (Petrol, Diesel and Hybrid). What I want is dataset with fuelType percentages per year.

Example:

 year  fuel    percentage    
 2003  Petrol  20.00
       Diesel  65.00
       Hybrid  15.00

So far I've managed to get the number of cars, per fuelType, per year and I'm not even sure of how I did it so... Here is the code for that:

Fuel_Year = my_data_frame.groupby(['year', 'fuelType'])
df = pd.concat([Fuel_Year.fuelType.count()], axis=1, keys="Counts")
print(df)
                  C
year fuelType      
1997 Petrol       1
1998 Petrol       1
2002 Petrol       2
2003 Diesel       4
     Petrol       2
2004 Diesel       1
     Petrol       4
2005 Diesel       2
     Petrol       5
2006 Diesel       3
     Petrol       6
2007 Diesel       7
     Petrol       9

I've tried different things and landed on the transform('sum') function on many web pages I checked, but this hasn't been a success for me. My ultimate goal with this is to create some sort of stacked histogram plot with the years on the x-axis and percentages on the y-axis to show evolution of fuelType per year.

Could anybody please help me find the proper code to generate a column for fuelType percentages per year?

Thank you very much! :)

CodePudding user response:

df.groupby(['year'])['fuelType'].value_counts(normalize=True) * 100
    
year  fuelType
2016  Diesel       50.0
      Petrol       50.0
2017  Diesel       50.0
      Petrol       50.0
2019  Petrol      100.0
Name: fuelType, dtype: float64

CodePudding user response:

You could leverage the value_counts() function of pandas dataframe for this purpose. It will be handy for you in many situations.

round((df.groupby(['year'])['fuelType'].value_counts()/df.groupby('year')['fuelType'].count()) * 100,2)

year  fuelType
1997  Petrol      100.00
1998  Petrol      100.00
2002  Petrol      100.00
2003  Diesel       66.67
      Petrol       33.33
2004  Petrol       80.00
      Diesel       20.00
2005  Petrol       71.43
      Diesel       28.57
2006  Petrol       66.67
      Diesel       33.33

Please let me know in case of any queries. Cheers!

  • Related