Home > front end >  Converting Summary() output into data.table
Converting Summary() output into data.table

Time:12-01

Lets call summary(my_data):

      year         quarter         employed         newhires       separations      jobscreated     jobsdestroyed  
 Min.   :1990   Min.   :1.000   Min.   :  6976   Min.   :  2321   Min.   :  1922   Min.   :  1091   Min.   :  520  
 1st Qu.:2000   1st Qu.:2.000   1st Qu.: 28049   1st Qu.: 16858   1st Qu.: 13912   1st Qu.:  6595   1st Qu.: 3862  
 Median :2003   Median :3.000   Median : 64836   Median : 39188   Median : 32018   Median : 14148   Median : 7727  
 Mean   :2003   Mean   :2.509   Mean   : 94468   Mean   : 59336   Mean   : 48973   Mean   : 22036   Mean   :11843  
 3rd Qu.:2007   3rd Qu.:4.000   3rd Qu.:121905   3rd Qu.: 75960   3rd Qu.: 61976   3rd Qu.: 26829   3rd Qu.:14993  
 Max.   :2010   Max.   :4.000   Max.   :571419   Max.   :448423   Max.   :391454   Max.   :166022   Max.   :80338  
                                                 NA's   :49                        NA's   :49       NA's   :49     

I want to convert this output into a data.table formatted as follows, where all entries (omitted in this depiction) are the raw values of min, 1st quartile. etc. :

            year         quarter         employed         newhires       separations      jobscreated     jobsdestroyed  
 Min.      
 1st Qu.   
 Median    
 Mean     
 3rd Qu.   
 Max.            
 NA's          

The following almost achieves this result, except for the fact that Min. , 1st Qu. , Median , Mean , 3rd Qu. , Max. , and NA's carry over into each entry. I want purely the raw numbers.

data.frame(unclass(summary(my_data)), check.names = FALSE, stringsAsFactors = FALSE)
             year         quarter         employed         newhires      separations      jobscreated   jobsdestroyed
X   Min.   :1990   Min.   :1.000   Min.   :  6976   Min.   :  2321   Min.   :  1922   Min.   :  1091   Min.   :  520  
X.1 1st Qu.:2000   1st Qu.:2.000   1st Qu.: 28049   1st Qu.: 16858   1st Qu.: 13912   1st Qu.:  6595   1st Qu.: 3862  
X.2 Median :2003   Median :3.000   Median : 64836   Median : 39188   Median : 32018   Median : 14148   Median : 7727  
X.3 Mean   :2003   Mean   :2.509   Mean   : 94468   Mean   : 59336   Mean   : 48973   Mean   : 22036   Mean   :11843  
X.4 3rd Qu.:2007   3rd Qu.:4.000   3rd Qu.:121905   3rd Qu.: 75960   3rd Qu.: 61976   3rd Qu.: 26829   3rd Qu.:14993  
X.5 Max.   :2010   Max.   :4.000   Max.   :571419   Max.   :448423   Max.   :391454   Max.   :166022   Max.   :80338  
X.6           <NA>            <NA>             <NA>     NA's   :49               <NA>     NA's   :49      NA's   :49  

Potential solutions include (1) deriving the table directly from summary(), or (2) using the output above and finding a way to remove Min. , 1st Qu. , Median , Mean , 3rd Qu. , Max. , and NA labels from reach entry and instead list them as column names. Your help is much appreicated!

CodePudding user response:

An adaptation of the dupe-link code, since that does not work as cleanly with incomplete (NA/NaN) data (e.g., newhires):

mtcars[2,2] <- NA
mtcars[3,2] <- NaN

as.data.frame(sapply(mtcars, summary))
# Error in dimnames(x) <- dnx : 'dimnames' applied to non-array

This fails because summary hard-codes (bleh) table(..., useNA="ifany"), which means that some columns might return length 6, some length 7, which defeats most casual attempts to conform into a data.frame-like structure.

One way around this (short of rewriting summary from scratch to fix that bug) is to add one NA to all vectors and subtract it from the result; that forces all summaries to include that field, and once subtracted it should represent the data and be rectangular-enough for as.data.frame:

fixed_summary <- function(object, ...) {
  o <- summary(c(object, NA), ...)
  o["NA's"] <- o["NA's"] - 1L
  o
}

ret <- as.data.frame(sapply(mtcars, fixed_summary))
ret
#              mpg      cyl     disp       hp     drat      wt     qsec     vs      am   gear   carb
# Min.    10.40000 4.000000  71.1000  52.0000 2.760000 1.51300 14.50000 0.0000 0.00000 3.0000 1.0000
# 1st Qu. 15.42500 4.000000 120.8250  96.5000 3.080000 2.58125 16.89250 0.0000 0.00000 3.0000 2.0000
# Median  19.20000 6.000000 196.3000 123.0000 3.695000 3.32500 17.71000 0.0000 0.00000 4.0000 2.0000
# Mean    20.09062 6.266667 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625 3.6875 2.8125
# 3rd Qu. 22.80000 8.000000 326.0000 180.0000 3.920000 3.61000 18.90000 1.0000 1.00000 4.0000 4.0000
# Max.    33.90000 8.000000 472.0000 335.0000 4.930000 5.42400 22.90000 1.0000 1.00000 5.0000 8.0000
# NA's     0.00000 2.000000   0.0000   0.0000 0.000000 0.00000  0.00000 0.0000 0.00000 0.0000 0.0000

And, per your concern, the "Min." (etc) labels do not carry-over for each column: they are row-names only.

Many R tools do not guarantee preserving row-names; in fact some go out of their way to wipe them. My preference is to not rely on row-names, instead bringing them in as an explicit column. This is mostly subjective, partially defensive programming, and certainly not a requirement.

ret$rownames <- rownames(ret)
rownames(ret) <- NULL
ret
#        mpg      cyl     disp       hp     drat      wt     qsec     vs      am   gear   carb rownames
# 1 10.40000 4.000000  71.1000  52.0000 2.760000 1.51300 14.50000 0.0000 0.00000 3.0000 1.0000     Min.
# 2 15.42500 4.000000 120.8250  96.5000 3.080000 2.58125 16.89250 0.0000 0.00000 3.0000 2.0000  1st Qu.
# 3 19.20000 6.000000 196.3000 123.0000 3.695000 3.32500 17.71000 0.0000 0.00000 4.0000 2.0000   Median
# 4 20.09062 6.266667 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625 3.6875 2.8125     Mean
# 5 22.80000 8.000000 326.0000 180.0000 3.920000 3.61000 18.90000 1.0000 1.00000 4.0000 4.0000  3rd Qu.
# 6 33.90000 8.000000 472.0000 335.0000 4.930000 5.42400 22.90000 1.0000 1.00000 5.0000 8.0000     Max.
# 7  0.00000 2.000000   0.0000   0.0000 0.000000 0.00000  0.00000 0.0000 0.00000 0.0000 0.0000     NA's

(The order of columns is completely malleable.)

CodePudding user response:

# adjust summary(.) 
# returns summary of numeric (including factor) columns of a data frame 
# stats_along='row', put summary stats on the rows and variables along the columns
my_summ <- function(df, stats_along='row') {

  df_nonchar = df[, !sapply(df, typeof) %in% "character"]
  summ = data.frame(summary(df_nonchar), row.names = NULL)

  # test for empty columns:
  # # usually the 1st column is empty as a result of coercing an obj of
  # class(summary obj) "table" to data.frame.
  empty = sapply(summ, function(x) all(x == ""))
  summ = summ[, !empty]
  summ = setNames(summ, c("var_name", "stats"))
  summ = summ[which(!is.na(summ$stats)), ]

  # just in case if there are multiple :'s, we need to split only at the first match
  summ$stats = sub(":", "-;-", summ$stats)
  summ = data.frame(summ[1], do.call(rbind, strsplit(summ$stats, "-;-")))
  names(summ)[-1] = c("stats", "value")
  summ$var_name = trimws(summ$var_name) # rm white spaces

  # pivot into wide form, using 'stats' column as a key.
  stats_along = match.arg(stats_along, c('row', 'col'))
  if (stats_along == 'row') {
    idvar = "stats"
    timevar = "var_name"
  } else if (stats_along == 'col') {
    idvar = "var_name"
    timevar = "stats"
  }

  summ = reshape(
    summ,
    direction = "wide",
    idvar = idvar,
    timevar = timevar,
    v.names = "value",
    sep = "_"

  )

  var_nms = sub("(value_)(. )", "\\2", names(summ)[-1])
  names(summ)[-1] = var_nms
  rownames(summ) = NULL


  # remove white spaces from cells
  summ[] = lapply(summ, function(x) gsub("\\s $", "", x))

  # when vars in the dataset contain NAs, we may have two additional columns in
  # summary call
  nas = "NA's" %in% colnames(summ)
  if (any(nas)) {
    names(summ)[names(summ) == "NA's"] = "missing"
  }
  summ
}
my_summ(mtcars)

    stats   mpg   cyl  disp    hp  drat    wt  qsec     vs     am  gear  carb
1    Min. 10.40 4.000  71.1  52.0 2.760 1.513 14.50 0.0000 0.0000 3.000 1.000
2 1st Qu. 15.43 4.000 120.8  96.5 3.080 2.581 16.89 0.0000 0.0000 3.000 2.000
3  Median 19.20 6.000 196.3 123.0 3.695 3.325 17.71 0.0000 0.0000 4.000 2.000
4    Mean 20.09 6.188 230.7 146.7 3.597 3.217 17.85 0.4375 0.4062 3.688 2.812
5 3rd Qu. 22.80 8.000 326.0 180.0 3.920 3.610 18.90 1.0000 1.0000 4.000 4.000
6    Max. 33.90 8.000 472.0 335.0 4.930 5.424 22.90 1.0000 1.0000 5.000 8.000

If vars along the row and stats along the column axes are prefered, then,

my_summ(mtcars, 'col')

   var_name Min.    1st Qu. Median  Mean    3rd Qu. Max.   
1       mpg   10.40   15.43   19.20   20.09   22.80   33.90
2       cyl   4.000   4.000   6.000   6.188   8.000   8.000
3      disp    71.1   120.8   196.3   230.7   326.0   472.0
4        hp    52.0    96.5   123.0   146.7   180.0   335.0
5      drat   2.760   3.080   3.695   3.597   3.920   4.930
6        wt   1.513   2.581   3.325   3.217   3.610   5.424
7      qsec   14.50   16.89   17.71   17.85   18.90   22.90
8        vs  0.0000  0.0000  0.0000  0.4375  1.0000  1.0000
9        am  0.0000  0.0000  0.0000  0.4062  1.0000  1.0000
10     gear   3.000   3.000   4.000   3.688   4.000   5.000
11     carb   1.000   2.000   2.000   2.812   4.000   8.000


  • Note: my_summ(.) |> as.data.table() if you need a data table instead.

CodePudding user response:

Another option is to create your own summary function that adds a sixth element if it is needed:

 ownSummary = function(x) {
    x = summary(x)
    if(length(x)==6) x[7] = 0
    x
  } 

An then you can run this.

data.table(sapply(mtcars, ownSummary), keep.rownames = T)

        rn      mpg      cyl     disp       hp     drat      wt     qsec     vs      am   gear   carb
1:    Min. 10.40000 4.000000  71.1000  52.0000 2.760000 1.51300 14.50000 0.0000 0.00000 3.0000 1.0000
2: 1st Qu. 15.42500 4.000000 120.8250  96.5000 3.080000 2.58125 16.89250 0.0000 0.00000 3.0000 2.0000
3:  Median 19.20000 6.000000 196.3000 123.0000 3.695000 3.32500 17.71000 0.0000 0.00000 4.0000 2.0000
4:    Mean 20.09062 6.266667 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625 3.6875 2.8125
5: 3rd Qu. 22.80000 8.000000 326.0000 180.0000 3.920000 3.61000 18.90000 1.0000 1.00000 4.0000 4.0000
6:    Max. 33.90000 8.000000 472.0000 335.0000 4.930000 5.42400 22.90000 1.0000 1.00000 5.0000 8.0000
7:          0.00000 2.000000   0.0000   0.0000 0.000000 0.00000  0.00000 0.0000 0.00000 0.0000 0.0000
  • Related