Home > front end >  How to group pandas dataframe by column headers and run summary statitics? [python]
How to group pandas dataframe by column headers and run summary statitics? [python]

Time:05-26

I am trying to make a summary statistic table from a dataframe and am having a really tough time trying to wrap my head around how to get the columns grouped correctly.

I have the following dataframe:

       Year     Value_1     Value 2
------------------------------------
0      2012          43          34
1      2012          45          24
2      2012          35          44
3      2012          32          44
4      2012          35          31
5      2012          65          51
...
76     2015          34          22
77     2015          74          34 
78     2015          63          76 
79     2015          54          45 
80     2015          25          76
...

I want to create a table that shows the mean, median, min, and max, for each Value and for each Year. And so I want the output table to first be grouped by the Value and then by the Year, showing the mean, median, min, and max for each Value category for each year.

I tried simply doing:

df.groupby('Year').describe()

and this groupy by year as I want, but it is not grouped by the Value categories, and so repeats the mean, median, min, and max columns twice. I don't want the "mean", "median", "min", and "max" columns to be repeated. How can this be worked out in python? I am having trouble conceptualizing how this grouping should be organized.

I am looking to produce this output dataframe:

  Value      Year     Mean    Median    Min    Max
---------------------------------------------------
Value_1      2012       XX        XX     XX     XX
             2013       XX        XX     XX     XX 
             2014       XX        XX     XX     XX
             2015       XX        XX     XX     XX
Value_2      2012       XX        XX     XX     XX
             2013       XX        XX     XX     XX
             2014       XX        XX     XX     XX
             2015       XX        XX     XX     XX

CodePudding user response:

IIUC you want it to look like this:

EDIT: Have another (shorter) attempt:

(1)

df = pd.DataFrame(
    {'Year': [2012, 2012, 2012, 2012, 2013, 2013, 2015, 2015, 2015, 2016, 2016],
     'Value_1': [43, 45, 35, 32, 35, 65, 34, 74, 63, 54, 25],
     'Value_2': [34, 24, 44, 44, 31, 51, 22, 34, 76, 45, 76]})

res = (df
       .melt('Year')
       .groupby(['variable', 'Year'])
       .describe()
       .droplevel(0,axis=1))

print(res)

               count   mean        std   min    25%   50%    75%   max
variable Year                                                         
Value_1  2012    4.0  38.75   6.238322  32.0  34.25  39.0  43.50  45.0
         2013    2.0  50.00  21.213203  35.0  42.50  50.0  57.50  65.0
         2015    3.0  57.00  20.663978  34.0  48.50  63.0  68.50  74.0
         2016    2.0  39.50  20.506097  25.0  32.25  39.5  46.75  54.0
Value_2  2012    4.0  36.50   9.574271  24.0  31.50  39.0  44.00  44.0
         2013    2.0  41.00  14.142136  31.0  36.00  41.0  46.00  51.0
         2015    3.0  44.00  28.354894  22.0  28.00  34.0  55.00  76.0
         2016    2.0  60.50  21.920310  45.0  52.75  60.5  68.25  76.0

(2)

df = pd.DataFrame(
    {'Year': [2012, 2012, 2012, 2012, 2013, 2013, 2015, 2015, 2015, 2016, 2016],
     'Value_1': [43, 45, 35, 32, 35, 65, 34, 74, 63, 54, 25],
     'Value_2': [34, 24, 44, 44, 31, 51, 22, 34, 76, 45, 76]})

res = (df
       .groupby('Year')
       .describe()
       .unstack()
       .unstack(level=1)
       .reset_index()
       .rename(columns={'level_0' : 'value'})
       .set_index(['value', 'Year']))

print(res)

              count   mean        std   min    25%   50%    75%   max
value   Year                                                         
Value_1 2012    4.0  38.75   6.238322  32.0  34.25  39.0  43.50  45.0
        2013    2.0  50.00  21.213203  35.0  42.50  50.0  57.50  65.0
        2015    3.0  57.00  20.663978  34.0  48.50  63.0  68.50  74.0
        2016    2.0  39.50  20.506097  25.0  32.25  39.5  46.75  54.0
Value_2 2012    4.0  36.50   9.574271  24.0  31.50  39.0  44.00  44.0
        2013    2.0  41.00  14.142136  31.0  36.00  41.0  46.00  51.0
        2015    3.0  44.00  28.354894  22.0  28.00  34.0  55.00  76.0
        2016    2.0  60.50  21.920310  45.0  52.75  60.5  68.25  76.0

  • Related