With Pandas in Python there is the describe() function that returns the summary statistics for a dataframe. The output is not in a 'tidy' format for simple manipulation with the tidyverse summarise function but it is in a nice format for presentation. My question is how to reproduce this output in R?
import pandas as pd
mtcars_df = pd.read_csv(filepath_or_buffer="data/mtcars.csv")
mtcars_df.describe()
'''
mpg cyl disp ... am gear carb
count 32.000000 32.000000 32.000000 ... 32.000000 32.000000 32.0000
mean 20.090625 6.187500 230.721875 ... 0.406250 3.687500 2.8125
std 6.026948 1.785922 123.938694 ... 0.498991 0.737804 1.6152
min 10.400000 4.000000 71.100000 ... 0.000000 3.000000 1.0000
25% 15.425000 4.000000 120.825000 ... 0.000000 3.000000 2.0000
50% 19.200000 6.000000 196.300000 ... 0.000000 4.000000 2.0000
75% 22.800000 8.000000 326.000000 ... 1.000000 4.000000 4.0000
max 33.900000 8.000000 472.000000 ... 1.000000 5.000000 8.0000
'''
To reproduce this output in R I have used the base R summary function. Unfortunately the output duplicated the stat labels on every column. To remove the labels I wrangled the table into a dataframe and stripped out the labels with regex! Far more effort than I was expecting. If there is a cleaner, easier way in R I would love to know.
library(tidyverse)
library(rebus)
#>
#> Attaching package: 'rebus'
#> The following object is masked from 'package:stringr':
#>
#> regex
#> The following object is masked from 'package:ggplot2':
#>
#> alpha
stats_table <- summary(mtcars)
stats_table
#> mpg cyl disp hp
#> Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
#> 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
#> Median :19.20 Median :6.000 Median :196.3 Median :123.0
#> Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
#> 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
#> Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
#> drat wt qsec vs
#> Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
#> 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
#> Median :3.695 Median :3.325 Median :17.71 Median :0.0000
#> Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
#> 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
#> Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
#> am gear carb
#> Min. :0.0000 Min. :3.000 Min. :1.000
#> 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
#> Median :0.0000 Median :4.000 Median :2.000
#> Mean :0.4062 Mean :3.688 Mean :2.812
#> 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
#> Max. :1.0000 Max. :5.000 Max. :8.000
pattern <- one_or_more(DGT) %R% optional(".") %R% optional(one_or_more(DGT))
get_labels <- as.data.frame.matrix(stats_table)[,1]
location <- str_locate_all(pattern =':', get_labels)[[1]][1]
strip_punct <- zero_or_more(PUNCT) %R% zero_or_more(SPACE) %R% PUNCT
identity <- str_remove_all(str_sub(string = get_labels, start = 1, end = location), strip_punct)
stats_df <- as.data.frame.matrix(stats_table) %>%
mutate(across(everything(), ~str_match(., pattern))) %>%
mutate(identity = identity) %>%
relocate(identity)
stats_df
#> identity mpg cyl disp hp drat wt qsec
#> X Min 10.4 4.0 71.1 52.0 2.7 1.5 14.5
#> X.1 1st Qu 1 1 1 1 1 1 1
#> X.2 Median 19.2 6.0 196.3 123.0 3.6 3.3 17.7
#> X.3 Mean 20.0 6.1 230.7 146.7 3.5 3.2 17.8
#> X.4 3rd Qu 3 3 3 3 3 3 3
#> X.5 Max 33.9 8.0 472.0 335.0 4.9 5.4 22.9
#> vs am gear carb
#> X 0.0 0.0 3.0 1.0
#> X.1 1 1 1 1
#> X.2 0.0 0.0 4.0 2.0
#> X.3 0.4 0.4 3.6 2.8
#> X.4 3 3 3 3
#> X.5 1.0 1.0 5.0 8.0
I can generate the same values with tidyverse and the summarise function but everything is on one row rather than being arranged with the stats for each column summarised by row. This makes it rather difficult to read and present.
mtcars %>%
summarise_all( .funs = list(
min = min,
mean = ~ mean(., na.rm=TRUE),
median = median,
stdev = sd,
percentile_25 = ~ quantile(., .25)[[1]],
percentile_75 = ~ quantile(., .75)[[1]],
max = max)
) %>% glimpse()
#> Rows: 1
#> Columns: 77
#> $ mpg_min <dbl> 10.4
#> $ cyl_min <dbl> 4
#> $ disp_min <dbl> 71.1
#> $ hp_min <dbl> 52
#> $ drat_min <dbl> 2.76
#> $ wt_min <dbl> 1.513
#> $ qsec_min <dbl> 14.5
#> $ vs_min <dbl> 0
#> $ am_min <dbl> 0
#> $ gear_min <dbl> 3
#> $ carb_min <dbl> 1
#> $ mpg_mean <dbl> 20.09062
#> $ cyl_mean <dbl> 6.1875
#> $ disp_mean <dbl> 230.7219
#> $ hp_mean <dbl> 146.6875
#> $ drat_mean <dbl> 3.596563
#> $ wt_mean <dbl> 3.21725
#> $ qsec_mean <dbl> 17.84875
#> $ vs_mean <dbl> 0.4375
#> $ am_mean <dbl> 0.40625
#> $ gear_mean <dbl> 3.6875
#> $ carb_mean <dbl> 2.8125
#> $ mpg_median <dbl> 19.2
#> $ cyl_median <dbl> 6
#> $ disp_median <dbl> 196.3
#> $ hp_median <dbl> 123
#> $ drat_median <dbl> 3.695
#> $ wt_median <dbl> 3.325
#> $ qsec_median <dbl> 17.71
#> $ vs_median <dbl> 0
#> $ am_median <dbl> 0
#> $ gear_median <dbl> 4
#> $ carb_median <dbl> 2
#> $ mpg_stdev <dbl> 6.026948
#> $ cyl_stdev <dbl> 1.785922
#> $ disp_stdev <dbl> 123.9387
#> $ hp_stdev <dbl> 68.56287
#> $ drat_stdev <dbl> 0.5346787
#> $ wt_stdev <dbl> 0.9784574
#> $ qsec_stdev <dbl> 1.786943
#> $ vs_stdev <dbl> 0.5040161
#> $ am_stdev <dbl> 0.4989909
#> $ gear_stdev <dbl> 0.7378041
#> $ carb_stdev <dbl> 1.6152
#> $ mpg_percentile_25 <dbl> 15.425
#> $ cyl_percentile_25 <dbl> 4
#> $ disp_percentile_25 <dbl> 120.825
#> $ hp_percentile_25 <dbl> 96.5
#> $ drat_percentile_25 <dbl> 3.08
#> $ wt_percentile_25 <dbl> 2.58125
#> $ qsec_percentile_25 <dbl> 16.8925
#> $ vs_percentile_25 <dbl> 0
#> $ am_percentile_25 <dbl> 0
#> $ gear_percentile_25 <dbl> 3
#> $ carb_percentile_25 <dbl> 2
#> $ mpg_percentile_75 <dbl> 22.8
#> $ cyl_percentile_75 <dbl> 8
#> $ disp_percentile_75 <dbl> 326
#> $ hp_percentile_75 <dbl> 180
#> $ drat_percentile_75 <dbl> 3.92
#> $ wt_percentile_75 <dbl> 3.61
#> $ qsec_percentile_75 <dbl> 18.9
#> $ vs_percentile_75 <dbl> 1
#> $ am_percentile_75 <dbl> 1
#> $ gear_percentile_75 <dbl> 4
#> $ carb_percentile_75 <dbl> 4
#> $ mpg_max <dbl> 33.9
#> $ cyl_max <dbl> 8
#> $ disp_max <dbl> 472
#> $ hp_max <dbl> 335
#> $ drat_max <dbl> 4.93
#> $ wt_max <dbl> 5.424
#> $ qsec_max <dbl> 22.9
#> $ vs_max <dbl> 1
#> $ am_max <dbl> 1
#> $ gear_max <dbl> 5
#> $ carb_max <dbl> 8
Created on 2022-03-12 by the reprex package (v2.0.1)
CodePudding user response:
You can combine do.call()
with rind()
and lapply()
to get a tidy format of summary()
. t()
transpose the output.
t(do.call(rbind, lapply(mtcars, summary)))
#> mpg cyl disp hp drat wt qsec vs
#> Min. 10.40000 4.0000 71.1000 52.0000 2.760000 1.51300 14.50000 0.0000
#> 1st Qu. 15.42500 4.0000 120.8250 96.5000 3.080000 2.58125 16.89250 0.0000
#> Median 19.20000 6.0000 196.3000 123.0000 3.695000 3.32500 17.71000 0.0000
#> Mean 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375
#> 3rd Qu. 22.80000 8.0000 326.0000 180.0000 3.920000 3.61000 18.90000 1.0000
#> Max. 33.90000 8.0000 472.0000 335.0000 4.930000 5.42400 22.90000 1.0000
#> am gear carb
#> Min. 0.00000 3.0000 1.0000
#> 1st Qu. 0.00000 3.0000 2.0000
#> Median 0.00000 4.0000 2.0000
#> Mean 0.40625 3.6875 2.8125
#> 3rd Qu. 1.00000 4.0000 4.0000
#> Max. 1.00000 5.0000 8.0000
Created on 2022-03-12 by the reprex package (v2.0.1)
CodePudding user response:
Another consideration could be the psych
package and its describe
function.
t(psych::describe(mtcars))