Home > database >  How to return an 'untidy' dataframe summary of multiple standard statistics arranged by ro
How to return an 'untidy' dataframe summary of multiple standard statistics arranged by ro

Time:03-12

With Pandas in Python there is the describe() function that returns the summary statistics for a dataframe. The output is not in a 'tidy' format for simple manipulation with the tidyverse summarise function but it is in a nice format for presentation. My question is how to reproduce this output in R?

import pandas as pd
mtcars_df = pd.read_csv(filepath_or_buffer="data/mtcars.csv")

mtcars_df.describe()
'''
             mpg        cyl        disp  ...         am       gear     carb
count  32.000000  32.000000   32.000000  ...  32.000000  32.000000  32.0000
mean   20.090625   6.187500  230.721875  ...   0.406250   3.687500   2.8125
std     6.026948   1.785922  123.938694  ...   0.498991   0.737804   1.6152
min    10.400000   4.000000   71.100000  ...   0.000000   3.000000   1.0000
25%    15.425000   4.000000  120.825000  ...   0.000000   3.000000   2.0000
50%    19.200000   6.000000  196.300000  ...   0.000000   4.000000   2.0000
75%    22.800000   8.000000  326.000000  ...   1.000000   4.000000   4.0000
max    33.900000   8.000000  472.000000  ...   1.000000   5.000000   8.0000
'''

To reproduce this output in R I have used the base R summary function. Unfortunately the output duplicated the stat labels on every column. To remove the labels I wrangled the table into a dataframe and stripped out the labels with regex! Far more effort than I was expecting. If there is a cleaner, easier way in R I would love to know.

library(tidyverse)
library(rebus)
#> 
#> Attaching package: 'rebus'
#> The following object is masked from 'package:stringr':
#> 
#>     regex
#> The following object is masked from 'package:ggplot2':
#> 
#>     alpha
    
stats_table <- summary(mtcars)
stats_table
#>       mpg             cyl             disp             hp       
#>  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
#>  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
#>  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
#>  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
#>  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
#>  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
#>       drat             wt             qsec             vs        
#>  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
#>  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
#>  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
#>  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
#>  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
#>  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
#>        am              gear            carb      
#>  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
#>  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
#>  Median :0.0000   Median :4.000   Median :2.000  
#>  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
#>  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
#>  Max.   :1.0000   Max.   :5.000   Max.   :8.000

pattern <- one_or_more(DGT) %R% optional(".") %R% optional(one_or_more(DGT))

get_labels <- as.data.frame.matrix(stats_table)[,1]
location <- str_locate_all(pattern =':', get_labels)[[1]][1]
strip_punct <- zero_or_more(PUNCT) %R% zero_or_more(SPACE) %R% PUNCT

identity <- str_remove_all(str_sub(string = get_labels, start = 1, end = location), strip_punct)

stats_df <- as.data.frame.matrix(stats_table) %>%
    mutate(across(everything(), ~str_match(., pattern))) %>%
    mutate(identity = identity) %>%
    relocate(identity)

stats_df
#>     identity      mpg      cyl      disp       hp      drat       wt      qsec
#> X        Min     10.4      4.0      71.1     52.0       2.7      1.5      14.5
#> X.1   1st Qu        1        1         1        1         1        1         1
#> X.2   Median     19.2      6.0     196.3    123.0       3.6      3.3      17.7
#> X.3     Mean     20.0      6.1     230.7    146.7       3.5      3.2      17.8
#> X.4   3rd Qu        3        3         3        3         3        3         3
#> X.5      Max     33.9      8.0     472.0    335.0       4.9      5.4      22.9
#>           vs       am      gear      carb
#> X        0.0      0.0       3.0       1.0
#> X.1        1        1         1         1
#> X.2      0.0      0.0       4.0       2.0
#> X.3      0.4      0.4       3.6       2.8
#> X.4        3        3         3         3
#> X.5      1.0      1.0       5.0       8.0

I can generate the same values with tidyverse and the summarise function but everything is on one row rather than being arranged with the stats for each column summarised by row. This makes it rather difficult to read and present.

mtcars %>%
    summarise_all( .funs = list(
                         min = min,
                         mean = ~ mean(., na.rm=TRUE),
                         median = median,
                         stdev = sd,
                         percentile_25 = ~ quantile(., .25)[[1]],
                         percentile_75 = ~ quantile(., .75)[[1]],
                         max = max)
                     ) %>% glimpse()
#> Rows: 1
#> Columns: 77
#> $ mpg_min            <dbl> 10.4
#> $ cyl_min            <dbl> 4
#> $ disp_min           <dbl> 71.1
#> $ hp_min             <dbl> 52
#> $ drat_min           <dbl> 2.76
#> $ wt_min             <dbl> 1.513
#> $ qsec_min           <dbl> 14.5
#> $ vs_min             <dbl> 0
#> $ am_min             <dbl> 0
#> $ gear_min           <dbl> 3
#> $ carb_min           <dbl> 1
#> $ mpg_mean           <dbl> 20.09062
#> $ cyl_mean           <dbl> 6.1875
#> $ disp_mean          <dbl> 230.7219
#> $ hp_mean            <dbl> 146.6875
#> $ drat_mean          <dbl> 3.596563
#> $ wt_mean            <dbl> 3.21725
#> $ qsec_mean          <dbl> 17.84875
#> $ vs_mean            <dbl> 0.4375
#> $ am_mean            <dbl> 0.40625
#> $ gear_mean          <dbl> 3.6875
#> $ carb_mean          <dbl> 2.8125
#> $ mpg_median         <dbl> 19.2
#> $ cyl_median         <dbl> 6
#> $ disp_median        <dbl> 196.3
#> $ hp_median          <dbl> 123
#> $ drat_median        <dbl> 3.695
#> $ wt_median          <dbl> 3.325
#> $ qsec_median        <dbl> 17.71
#> $ vs_median          <dbl> 0
#> $ am_median          <dbl> 0
#> $ gear_median        <dbl> 4
#> $ carb_median        <dbl> 2
#> $ mpg_stdev          <dbl> 6.026948
#> $ cyl_stdev          <dbl> 1.785922
#> $ disp_stdev         <dbl> 123.9387
#> $ hp_stdev           <dbl> 68.56287
#> $ drat_stdev         <dbl> 0.5346787
#> $ wt_stdev           <dbl> 0.9784574
#> $ qsec_stdev         <dbl> 1.786943
#> $ vs_stdev           <dbl> 0.5040161
#> $ am_stdev           <dbl> 0.4989909
#> $ gear_stdev         <dbl> 0.7378041
#> $ carb_stdev         <dbl> 1.6152
#> $ mpg_percentile_25  <dbl> 15.425
#> $ cyl_percentile_25  <dbl> 4
#> $ disp_percentile_25 <dbl> 120.825
#> $ hp_percentile_25   <dbl> 96.5
#> $ drat_percentile_25 <dbl> 3.08
#> $ wt_percentile_25   <dbl> 2.58125
#> $ qsec_percentile_25 <dbl> 16.8925
#> $ vs_percentile_25   <dbl> 0
#> $ am_percentile_25   <dbl> 0
#> $ gear_percentile_25 <dbl> 3
#> $ carb_percentile_25 <dbl> 2
#> $ mpg_percentile_75  <dbl> 22.8
#> $ cyl_percentile_75  <dbl> 8
#> $ disp_percentile_75 <dbl> 326
#> $ hp_percentile_75   <dbl> 180
#> $ drat_percentile_75 <dbl> 3.92
#> $ wt_percentile_75   <dbl> 3.61
#> $ qsec_percentile_75 <dbl> 18.9
#> $ vs_percentile_75   <dbl> 1
#> $ am_percentile_75   <dbl> 1
#> $ gear_percentile_75 <dbl> 4
#> $ carb_percentile_75 <dbl> 4
#> $ mpg_max            <dbl> 33.9
#> $ cyl_max            <dbl> 8
#> $ disp_max           <dbl> 472
#> $ hp_max             <dbl> 335
#> $ drat_max           <dbl> 4.93
#> $ wt_max             <dbl> 5.424
#> $ qsec_max           <dbl> 22.9
#> $ vs_max             <dbl> 1
#> $ am_max             <dbl> 1
#> $ gear_max           <dbl> 5
#> $ carb_max           <dbl> 8

Created on 2022-03-12 by the reprex package (v2.0.1)

CodePudding user response:

You can combine do.call() with rind() and lapply() to get a tidy format of summary(). t() transpose the output.

t(do.call(rbind, lapply(mtcars, summary)))

#>              mpg    cyl     disp       hp     drat      wt     qsec     vs
#> Min.    10.40000 4.0000  71.1000  52.0000 2.760000 1.51300 14.50000 0.0000
#> 1st Qu. 15.42500 4.0000 120.8250  96.5000 3.080000 2.58125 16.89250 0.0000
#> Median  19.20000 6.0000 196.3000 123.0000 3.695000 3.32500 17.71000 0.0000
#> Mean    20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375
#> 3rd Qu. 22.80000 8.0000 326.0000 180.0000 3.920000 3.61000 18.90000 1.0000
#> Max.    33.90000 8.0000 472.0000 335.0000 4.930000 5.42400 22.90000 1.0000
#>              am   gear   carb
#> Min.    0.00000 3.0000 1.0000
#> 1st Qu. 0.00000 3.0000 2.0000
#> Median  0.00000 4.0000 2.0000
#> Mean    0.40625 3.6875 2.8125
#> 3rd Qu. 1.00000 4.0000 4.0000
#> Max.    1.00000 5.0000 8.0000

Created on 2022-03-12 by the reprex package (v2.0.1)

CodePudding user response:

Another consideration could be the psych package and its describe function.

t(psych::describe(mtcars))
  • Related