I have a data.table with over ten thousand of rows and it looks like this:
DT1 <- data.table(ID = 1:10,
result_2010 = c("TRUE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE", "FALSE"),
result_2011 = c("FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE"),
years = c(15, 16.5, 31, 1, 40.2, 0.3, 12, 22.7, 19, 12))
ID result_2010 result_2011 years
1: 1 TRUE FALSE 15.0
2: 2 FALSE TRUE 16.5
3: 3 TRUE FALSE 31.0
4: 4 FALSE FALSE 1.0
5: 5 FALSE FALSE 40.2
6: 6 TRUE FALSE 0.3
7: 7 FALSE TRUE 12.0
8: 8 FALSE FALSE 22.7
9: 9 TRUE FALSE 19.0
10: 10 FALSE TRUE 12.0
For "result_2010" and "result_2011" I want to make a percentile analysis of the "years" but only if the value in for the individual is "TRUE". The code that I tried seems to work, but it gives back the same results for "result_2010" and "result_2011", which is for sure incorrect:
DT1 %>%
group_by(result_2010 == "TRUE") %>%
summarise("10.quantile"= round(quantile(years,c(.10)),digits=1),
"25.quantile"= round(quantile(years,c(.25)),digits=1),
"Median"= round(quantile(years,c(.50)),digits=1),
"75.quantile"= round(quantile(years,c(.75)),digits=1),
"90.quantile"= round(quantile(years,c(.90)),digits=1),
"Mean" = round(mean(years),digits=1))
DT1 %>%
group_by(result_2011 == "TRUE") %>%
summarise("10.quantile"= round(quantile(years,c(.10)),digits=1),
"25.quantile"= round(quantile(years,c(.25)),digits=1),
"Median"= round(quantile(years,c(.50)),digits=1),
"75.quantile"= round(quantile(years,c(.75)),digits=1),
"90.quantile"= round(quantile(years,c(.90)),digits=1),
"Mean" = round(mean(years),digits=1))
Could anyone help how to correct my code?
CodePudding user response:
Using melt
and aggregate
.
library(data.table)
melt(DT1, c(1, 4), 2:3) |>
transform(variable=substring(variable, 8)) |>
subset(value == TRUE) |>
with(aggregate(list(q=years), list(year=variable), \(x)
c(quantile(x), mean=mean(x))))
# year q.0% q.25% q.50% q.75% q.100% q.mean
# 1 2010 0.300 11.325 17.000 22.000 31.000 16.325
# 2 2011 12.000 12.000 12.000 14.250 16.500 13.500
Note: Please use R>=4.1 for the |>
pipes.
CodePudding user response:
You may write-up a function and run it on every result
column.
library(tidyverse)
cols <- grep('result_', names(DT1), value = TRUE)
get_stats_fun <- function(DT, col) {
DT %>%
filter(.data[[col]] == "TRUE") %>%
summarise("quantile" = list(round(quantile(years,c(.10,.25,.50,.75,.90)),1)),
"median" = round(median(years), 1),
"Mean" = round(mean(years),1)) %>%
unnest_wider(quantile)
}
map_df(cols, ~get_stats_fun(DT1, .x), .id = 'Year') %>%
mutate(Year = cols)
# Year `10%` `25%` `50%` `75%` `90%` median Mean
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 result_2010 4.7 11.3 17 22 27.4 17 16.3
#2 result_2011 12 12 12 14.2 15.6 12 13.5