I have a data frame with encounter ID and laboratory results of each encounter as follows:
library(tidyverse)
#Columns with variables
encounter_id <- c(1,1,1, 2,2,2, 3,3,3, 4,4,4)
wbc <- c(13,NA,13, NA,NA,14, 15,9,15, 11,10,12)
hb <- c(13,NA,12, NA,13,14, 15,NA,15, 11,8,12)
temp <- c(100,NA,97, 103,NA,104, 100,99,NA, 100,101,101)
hr <- c(133,NA,132, NA,NA,104, 155,160,NA, 60,NA,70)
within_24_hours <- c(1,0,1, 1,1,0, 1,1,1, 0,0,1)
#create a tibble from above mentioned variables
df <- tibble(encounter_id, wbc, hb, temp, hr, within_24_hours)
I want to find the mean, median, max, and min of each variable in each encounter; for example, I want to find out the mean number of the "WBC" column in encounters 1, 2, 3, and so on for each variable, IF they were within 24 hours (wihtin_24_hours variable ==1)
My first thought process is to use pivot wider, drop the variables that are not within 24 hours, then aggregate results as follows:
df_wider <- df %>% pivot_wider(
id_cols = encounter_id,
names_from = within_24_hours,
values_from = c(2:5)
)
but I could not, as I keep getting a list for each variable instead of the number.
encounter_id wbc_1 wbc_0 hb_1 hb_0 temp_1 temp_0 hr_1
hr_0
<dbl> <list> <list> <list> <list> <list> <list> <list>
<list>
1 1 <dbl [2]> <dbl [1]> <dbl [2]> <dbl [1]> <dbl [2]> <dbl [1]> <dbl [2]>
<dbl [~
2 2 <dbl [2]> <dbl [1]> <dbl [2]> <dbl [1]> <dbl [2]> <dbl [1]> <dbl [2]>
<dbl [~
3 3 <dbl [3]> <NULL> <dbl [3]> <NULL> <dbl [3]> <NULL> <dbl [3]>
<NULL>
4 4 <dbl [1]> <dbl [2]> <dbl [1]> <dbl [2]> <dbl [1]> <dbl [2]> <dbl [1]>
<dbl [~
Any ideas on how to proceed?
CodePudding user response:
You don't need to pivot your data.
Below is my solution that's similar to the previous answer.
library(tidyverse)
df %>%
filter(within_24_hours == 1) %>% # Get only encounters within 24 hours
select(-within_24_hours) %>% # Drop the within 24 hours column since it won't be aggregated
mutate(across(wbc:hr, as.numeric)) %>% # ensure the variables to aggregate are all numeric
group_by(encounter_id) %>%
summarise(
across(everything(),
list(mean=~mean(.x, na.rm = TRUE),
median=~median(.x, na.rm = TRUE),
max=~max(.x, na.rm = TRUE),
min=~min(.x, na.rm = TRUE)),
.names = "{.col}_{.fn}"
)
)
Result:
# A tibble: 4 x 17
encounter_id wbc_mean wbc_median wbc_max wbc_min hb_mean hb_median hb_max hb_min temp_mean temp_median temp_max temp_min hr_mean hr_median hr_max
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 13 13 13 13 12.5 12.5 13 12 98.5 98.5 100 97 132. 132. 133
2 2 NaN NA -Inf Inf 13 13 13 13 103 103 103 103 NaN NA -Inf
3 3 13 15 15 9 15 15 15 15 99.5 99.5 100 99 158. 158. 160
4 4 12 12 12 12 12 12 12 12 101 101 101 101 70 70 70
CodePudding user response:
I think if you want to group by encounter_id
and variable, and filter by within_24_hours
, a better approach might be to pivot longer, not wider.
Note that using your example data some functions will return Inf or -Inf and a warning where na.rm = TRUE
is used, because for some groups all values are NA.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -c(encounter_id, within_24_hours)) %>%
filter(within_24_hours == 1) %>%
group_by(encounter_id, name) %>%
summarise(meanVal = mean(value, na.rm = TRUE),
medianVal = median(value, na.rm = TRUE),
maxVal = max(value, na.rm = TRUE),
minVal = min(value, na.rm = TRUE)) %>%
ungroup()
Result:
encounter_id name meanVal medianVal maxVal minVal
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 hb 12.5 12.5 13 12
2 1 hr 132. 132. 133 132
3 1 temp 98.5 98.5 100 97
4 1 wbc 13 13 13 13
5 2 hb 13 13 13 13
6 2 hr NaN NA -Inf Inf
7 2 temp 103 103 103 103
8 2 wbc NaN NA -Inf Inf
9 3 hb 15 15 15 15
10 3 hr 158. 158. 160 155
11 3 temp 99.5 99.5 100 99
12 3 wbc 13 15 15 9
13 4 hb 12 12 12 12
14 4 hr 70 70 70 70
15 4 temp 101 101 101 101
16 4 wbc 12 12 12 12