I am trying to make a summary statistic table from a dataframe and am having a really tough time trying to wrap my head around how to get the columns grouped correctly.
I have the following dataframe:
Year Value_1 Value 2
------------------------------------
0 2012 43 34
1 2012 45 24
2 2012 35 44
3 2012 32 44
4 2012 35 31
5 2012 65 51
...
76 2015 34 22
77 2015 74 34
78 2015 63 76
79 2015 54 45
80 2015 25 76
...
I want to create a table that shows the mean, median, min, and max, for each Value
and for each Year. And so I want the output table to first be grouped by the Value
and then by the Year
, showing the mean, median, min, and max for each Value
category for each year.
I tried simply doing:
df.groupby('Year').describe()
and this groupy by year as I want, but it is not grouped by the Value
categories, and so repeats the mean, median, min, and max columns twice. I don't want the "mean", "median", "min", and "max" columns to be repeated. How can this be worked out in python? I am having trouble conceptualizing how this grouping should be organized.
I am looking to produce this output dataframe:
Value Year Mean Median Min Max
---------------------------------------------------
Value_1 2012 XX XX XX XX
2013 XX XX XX XX
2014 XX XX XX XX
2015 XX XX XX XX
Value_2 2012 XX XX XX XX
2013 XX XX XX XX
2014 XX XX XX XX
2015 XX XX XX XX
CodePudding user response:
IIUC you want it to look like this:
EDIT: Have another (shorter) attempt:
(1)
df = pd.DataFrame(
{'Year': [2012, 2012, 2012, 2012, 2013, 2013, 2015, 2015, 2015, 2016, 2016],
'Value_1': [43, 45, 35, 32, 35, 65, 34, 74, 63, 54, 25],
'Value_2': [34, 24, 44, 44, 31, 51, 22, 34, 76, 45, 76]})
res = (df
.melt('Year')
.groupby(['variable', 'Year'])
.describe()
.droplevel(0,axis=1))
print(res)
count mean std min 25% 50% 75% max
variable Year
Value_1 2012 4.0 38.75 6.238322 32.0 34.25 39.0 43.50 45.0
2013 2.0 50.00 21.213203 35.0 42.50 50.0 57.50 65.0
2015 3.0 57.00 20.663978 34.0 48.50 63.0 68.50 74.0
2016 2.0 39.50 20.506097 25.0 32.25 39.5 46.75 54.0
Value_2 2012 4.0 36.50 9.574271 24.0 31.50 39.0 44.00 44.0
2013 2.0 41.00 14.142136 31.0 36.00 41.0 46.00 51.0
2015 3.0 44.00 28.354894 22.0 28.00 34.0 55.00 76.0
2016 2.0 60.50 21.920310 45.0 52.75 60.5 68.25 76.0
(2)
df = pd.DataFrame(
{'Year': [2012, 2012, 2012, 2012, 2013, 2013, 2015, 2015, 2015, 2016, 2016],
'Value_1': [43, 45, 35, 32, 35, 65, 34, 74, 63, 54, 25],
'Value_2': [34, 24, 44, 44, 31, 51, 22, 34, 76, 45, 76]})
res = (df
.groupby('Year')
.describe()
.unstack()
.unstack(level=1)
.reset_index()
.rename(columns={'level_0' : 'value'})
.set_index(['value', 'Year']))
print(res)
count mean std min 25% 50% 75% max
value Year
Value_1 2012 4.0 38.75 6.238322 32.0 34.25 39.0 43.50 45.0
2013 2.0 50.00 21.213203 35.0 42.50 50.0 57.50 65.0
2015 3.0 57.00 20.663978 34.0 48.50 63.0 68.50 74.0
2016 2.0 39.50 20.506097 25.0 32.25 39.5 46.75 54.0
Value_2 2012 4.0 36.50 9.574271 24.0 31.50 39.0 44.00 44.0
2013 2.0 41.00 14.142136 31.0 36.00 41.0 46.00 51.0
2015 3.0 44.00 28.354894 22.0 28.00 34.0 55.00 76.0
2016 2.0 60.50 21.920310 45.0 52.75 60.5 68.25 76.0