So I have a used Audi car database from Kaggle.
Here is what my code looks like to import the dataset:
### headers ###
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
### path name ###
data_file_pathname = "etc.."
### import ###
my_data_frame = pd.read_csv(data_file_pathname, sep=",", header=0)
### preview of dataset ###
print(my_data_frame.head())
model year price transmission mileage fuelType tax mpg engineSize
0 A1 2017 12500 Manual 15735 Petrol 150 55.4 1.4
1 A6 2016 16500 Automatic 36203 Diesel 20 64.2 2.0
2 A1 2016 11000 Manual 29946 Petrol 30 55.4 1.4
3 A4 2017 16800 Automatic 25952 Diesel 145 67.3 2.0
4 A3 2019 17300 Manual 1998 Petrol 145 49.6 1.0
The years range from 1997 to 2021 and I have 3 fuelTypes (Petrol, Diesel and Hybrid). What I want is dataset with fuelType percentages per year.
Example:
year fuel percentage
2003 Petrol 20.00
Diesel 65.00
Hybrid 15.00
So far I've managed to get the number of cars, per fuelType, per year and I'm not even sure of how I did it so... Here is the code for that:
Fuel_Year = my_data_frame.groupby(['year', 'fuelType'])
df = pd.concat([Fuel_Year.fuelType.count()], axis=1, keys="Counts")
print(df)
C
year fuelType
1997 Petrol 1
1998 Petrol 1
2002 Petrol 2
2003 Diesel 4
Petrol 2
2004 Diesel 1
Petrol 4
2005 Diesel 2
Petrol 5
2006 Diesel 3
Petrol 6
2007 Diesel 7
Petrol 9
I've tried different things and landed on the transform('sum') function on many web pages I checked, but this hasn't been a success for me. My ultimate goal with this is to create some sort of stacked histogram plot with the years on the x-axis and percentages on the y-axis to show evolution of fuelType per year.
Could anybody please help me find the proper code to generate a column for fuelType percentages per year?
Thank you very much! :)
CodePudding user response:
df.groupby(['year'])['fuelType'].value_counts(normalize=True) * 100
year fuelType
2016 Diesel 50.0
Petrol 50.0
2017 Diesel 50.0
Petrol 50.0
2019 Petrol 100.0
Name: fuelType, dtype: float64
CodePudding user response:
You could leverage the value_counts() function of pandas dataframe for this purpose. It will be handy for you in many situations.
round((df.groupby(['year'])['fuelType'].value_counts()/df.groupby('year')['fuelType'].count()) * 100,2)
year fuelType
1997 Petrol 100.00
1998 Petrol 100.00
2002 Petrol 100.00
2003 Diesel 66.67
Petrol 33.33
2004 Petrol 80.00
Diesel 20.00
2005 Petrol 71.43
Diesel 28.57
2006 Petrol 66.67
Diesel 33.33
Please let me know in case of any queries. Cheers!