Lets call summary(my_data)
:
year quarter employed newhires separations jobscreated jobsdestroyed
Min. :1990 Min. :1.000 Min. : 6976 Min. : 2321 Min. : 1922 Min. : 1091 Min. : 520
1st Qu.:2000 1st Qu.:2.000 1st Qu.: 28049 1st Qu.: 16858 1st Qu.: 13912 1st Qu.: 6595 1st Qu.: 3862
Median :2003 Median :3.000 Median : 64836 Median : 39188 Median : 32018 Median : 14148 Median : 7727
Mean :2003 Mean :2.509 Mean : 94468 Mean : 59336 Mean : 48973 Mean : 22036 Mean :11843
3rd Qu.:2007 3rd Qu.:4.000 3rd Qu.:121905 3rd Qu.: 75960 3rd Qu.: 61976 3rd Qu.: 26829 3rd Qu.:14993
Max. :2010 Max. :4.000 Max. :571419 Max. :448423 Max. :391454 Max. :166022 Max. :80338
NA's :49 NA's :49 NA's :49
I want to convert this output into a data.table formatted as follows, where all entries (omitted in this depiction) are the raw values of min, 1st quartile. etc. :
year quarter employed newhires separations jobscreated jobsdestroyed
Min.
1st Qu.
Median
Mean
3rd Qu.
Max.
NA's
The following almost achieves this result, except for the fact that Min. , 1st Qu. , Median , Mean , 3rd Qu. , Max. , and NA's carry over into each entry. I want purely the raw numbers.
data.frame(unclass(summary(my_data)), check.names = FALSE, stringsAsFactors = FALSE)
year quarter employed newhires separations jobscreated jobsdestroyed
X Min. :1990 Min. :1.000 Min. : 6976 Min. : 2321 Min. : 1922 Min. : 1091 Min. : 520
X.1 1st Qu.:2000 1st Qu.:2.000 1st Qu.: 28049 1st Qu.: 16858 1st Qu.: 13912 1st Qu.: 6595 1st Qu.: 3862
X.2 Median :2003 Median :3.000 Median : 64836 Median : 39188 Median : 32018 Median : 14148 Median : 7727
X.3 Mean :2003 Mean :2.509 Mean : 94468 Mean : 59336 Mean : 48973 Mean : 22036 Mean :11843
X.4 3rd Qu.:2007 3rd Qu.:4.000 3rd Qu.:121905 3rd Qu.: 75960 3rd Qu.: 61976 3rd Qu.: 26829 3rd Qu.:14993
X.5 Max. :2010 Max. :4.000 Max. :571419 Max. :448423 Max. :391454 Max. :166022 Max. :80338
X.6 <NA> <NA> <NA> NA's :49 <NA> NA's :49 NA's :49
Potential solutions include (1) deriving the table directly from summary(), or (2) using the output above and finding a way to remove Min. , 1st Qu. , Median , Mean , 3rd Qu. , Max. , and NA labels from reach entry and instead list them as column names. Your help is much appreicated!
CodePudding user response:
An adaptation of the dupe-link code, since that does not work as cleanly with incomplete (NA
/NaN
) data (e.g., newhires
):
mtcars[2,2] <- NA
mtcars[3,2] <- NaN
as.data.frame(sapply(mtcars, summary))
# Error in dimnames(x) <- dnx : 'dimnames' applied to non-array
This fails because summary
hard-codes (bleh) table(..., useNA="ifany")
, which means that some columns might return length 6, some length 7, which defeats most casual attempts to conform into a data.frame
-like structure.
One way around this (short of rewriting summary
from scratch to fix that bug) is to add one NA
to all vectors and subtract it from the result; that forces all summaries to include that field, and once subtracted it should represent the data and be rectangular-enough for as.data.frame
:
fixed_summary <- function(object, ...) {
o <- summary(c(object, NA), ...)
o["NA's"] <- o["NA's"] - 1L
o
}
ret <- as.data.frame(sapply(mtcars, fixed_summary))
ret
# mpg cyl disp hp drat wt qsec vs am gear carb
# Min. 10.40000 4.000000 71.1000 52.0000 2.760000 1.51300 14.50000 0.0000 0.00000 3.0000 1.0000
# 1st Qu. 15.42500 4.000000 120.8250 96.5000 3.080000 2.58125 16.89250 0.0000 0.00000 3.0000 2.0000
# Median 19.20000 6.000000 196.3000 123.0000 3.695000 3.32500 17.71000 0.0000 0.00000 4.0000 2.0000
# Mean 20.09062 6.266667 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625 3.6875 2.8125
# 3rd Qu. 22.80000 8.000000 326.0000 180.0000 3.920000 3.61000 18.90000 1.0000 1.00000 4.0000 4.0000
# Max. 33.90000 8.000000 472.0000 335.0000 4.930000 5.42400 22.90000 1.0000 1.00000 5.0000 8.0000
# NA's 0.00000 2.000000 0.0000 0.0000 0.000000 0.00000 0.00000 0.0000 0.00000 0.0000 0.0000
And, per your concern, the "Min."
(etc) labels do not carry-over for each column: they are row-names only.
Many R tools do not guarantee preserving row-names; in fact some go out of their way to wipe them. My preference is to not rely on row-names, instead bringing them in as an explicit column. This is mostly subjective, partially defensive programming, and certainly not a requirement.
ret$rownames <- rownames(ret)
rownames(ret) <- NULL
ret
# mpg cyl disp hp drat wt qsec vs am gear carb rownames
# 1 10.40000 4.000000 71.1000 52.0000 2.760000 1.51300 14.50000 0.0000 0.00000 3.0000 1.0000 Min.
# 2 15.42500 4.000000 120.8250 96.5000 3.080000 2.58125 16.89250 0.0000 0.00000 3.0000 2.0000 1st Qu.
# 3 19.20000 6.000000 196.3000 123.0000 3.695000 3.32500 17.71000 0.0000 0.00000 4.0000 2.0000 Median
# 4 20.09062 6.266667 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625 3.6875 2.8125 Mean
# 5 22.80000 8.000000 326.0000 180.0000 3.920000 3.61000 18.90000 1.0000 1.00000 4.0000 4.0000 3rd Qu.
# 6 33.90000 8.000000 472.0000 335.0000 4.930000 5.42400 22.90000 1.0000 1.00000 5.0000 8.0000 Max.
# 7 0.00000 2.000000 0.0000 0.0000 0.000000 0.00000 0.00000 0.0000 0.00000 0.0000 0.0000 NA's
(The order of columns is completely malleable.)
CodePudding user response:
# adjust summary(.)
# returns summary of numeric (including factor) columns of a data frame
# stats_along='row', put summary stats on the rows and variables along the columns
my_summ <- function(df, stats_along='row') {
df_nonchar = df[, !sapply(df, typeof) %in% "character"]
summ = data.frame(summary(df_nonchar), row.names = NULL)
# test for empty columns:
# # usually the 1st column is empty as a result of coercing an obj of
# class(summary obj) "table" to data.frame.
empty = sapply(summ, function(x) all(x == ""))
summ = summ[, !empty]
summ = setNames(summ, c("var_name", "stats"))
summ = summ[which(!is.na(summ$stats)), ]
# just in case if there are multiple :'s, we need to split only at the first match
summ$stats = sub(":", "-;-", summ$stats)
summ = data.frame(summ[1], do.call(rbind, strsplit(summ$stats, "-;-")))
names(summ)[-1] = c("stats", "value")
summ$var_name = trimws(summ$var_name) # rm white spaces
# pivot into wide form, using 'stats' column as a key.
stats_along = match.arg(stats_along, c('row', 'col'))
if (stats_along == 'row') {
idvar = "stats"
timevar = "var_name"
} else if (stats_along == 'col') {
idvar = "var_name"
timevar = "stats"
}
summ = reshape(
summ,
direction = "wide",
idvar = idvar,
timevar = timevar,
v.names = "value",
sep = "_"
)
var_nms = sub("(value_)(. )", "\\2", names(summ)[-1])
names(summ)[-1] = var_nms
rownames(summ) = NULL
# remove white spaces from cells
summ[] = lapply(summ, function(x) gsub("\\s $", "", x))
# when vars in the dataset contain NAs, we may have two additional columns in
# summary call
nas = "NA's" %in% colnames(summ)
if (any(nas)) {
names(summ)[names(summ) == "NA's"] = "missing"
}
summ
}
my_summ(mtcars)
stats mpg cyl disp hp drat wt qsec vs am gear carb
1 Min. 10.40 4.000 71.1 52.0 2.760 1.513 14.50 0.0000 0.0000 3.000 1.000
2 1st Qu. 15.43 4.000 120.8 96.5 3.080 2.581 16.89 0.0000 0.0000 3.000 2.000
3 Median 19.20 6.000 196.3 123.0 3.695 3.325 17.71 0.0000 0.0000 4.000 2.000
4 Mean 20.09 6.188 230.7 146.7 3.597 3.217 17.85 0.4375 0.4062 3.688 2.812
5 3rd Qu. 22.80 8.000 326.0 180.0 3.920 3.610 18.90 1.0000 1.0000 4.000 4.000
6 Max. 33.90 8.000 472.0 335.0 4.930 5.424 22.90 1.0000 1.0000 5.000 8.000
If vars
along the row
and stats
along the column
axes are prefered, then,
my_summ(mtcars, 'col')
var_name Min. 1st Qu. Median Mean 3rd Qu. Max.
1 mpg 10.40 15.43 19.20 20.09 22.80 33.90
2 cyl 4.000 4.000 6.000 6.188 8.000 8.000
3 disp 71.1 120.8 196.3 230.7 326.0 472.0
4 hp 52.0 96.5 123.0 146.7 180.0 335.0
5 drat 2.760 3.080 3.695 3.597 3.920 4.930
6 wt 1.513 2.581 3.325 3.217 3.610 5.424
7 qsec 14.50 16.89 17.71 17.85 18.90 22.90
8 vs 0.0000 0.0000 0.0000 0.4375 1.0000 1.0000
9 am 0.0000 0.0000 0.0000 0.4062 1.0000 1.0000
10 gear 3.000 3.000 4.000 3.688 4.000 5.000
11 carb 1.000 2.000 2.000 2.812 4.000 8.000
- Note:
my_summ(.) |> as.data.table()
if you need a data table instead.
CodePudding user response:
Another option is to create your own summary function that adds a sixth element if it is needed:
ownSummary = function(x) {
x = summary(x)
if(length(x)==6) x[7] = 0
x
}
An then you can run this.
data.table(sapply(mtcars, ownSummary), keep.rownames = T)
rn mpg cyl disp hp drat wt qsec vs am gear carb
1: Min. 10.40000 4.000000 71.1000 52.0000 2.760000 1.51300 14.50000 0.0000 0.00000 3.0000 1.0000
2: 1st Qu. 15.42500 4.000000 120.8250 96.5000 3.080000 2.58125 16.89250 0.0000 0.00000 3.0000 2.0000
3: Median 19.20000 6.000000 196.3000 123.0000 3.695000 3.32500 17.71000 0.0000 0.00000 4.0000 2.0000
4: Mean 20.09062 6.266667 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625 3.6875 2.8125
5: 3rd Qu. 22.80000 8.000000 326.0000 180.0000 3.920000 3.61000 18.90000 1.0000 1.00000 4.0000 4.0000
6: Max. 33.90000 8.000000 472.0000 335.0000 4.930000 5.42400 22.90000 1.0000 1.00000 5.0000 8.0000
7: 0.00000 2.000000 0.0000 0.0000 0.000000 0.00000 0.00000 0.0000 0.00000 0.0000 0.0000