Home > Enterprise >  Pivot_wider giving me columns with lists in them
Pivot_wider giving me columns with lists in them

Time:07-15

I have a data frame with encounter ID and laboratory results of each encounter as follows:

library(tidyverse)

#Columns with variables 
encounter_id <- c(1,1,1, 2,2,2, 3,3,3, 4,4,4)
wbc <- c(13,NA,13, NA,NA,14, 15,9,15, 11,10,12)
hb <- c(13,NA,12, NA,13,14, 15,NA,15, 11,8,12)
temp <- c(100,NA,97, 103,NA,104, 100,99,NA, 100,101,101)
hr <- c(133,NA,132, NA,NA,104, 155,160,NA, 60,NA,70)
within_24_hours <- c(1,0,1, 1,1,0, 1,1,1, 0,0,1)

#create a tibble from above mentioned variables 
df <- tibble(encounter_id, wbc, hb, temp, hr, within_24_hours)

I want to find the mean, median, max, and min of each variable in each encounter; for example, I want to find out the mean number of the "WBC" column in encounters 1, 2, 3, and so on for each variable, IF they were within 24 hours (wihtin_24_hours variable ==1)

My first thought process is to use pivot wider, drop the variables that are not within 24 hours, then aggregate results as follows:

 df_wider <- df %>% pivot_wider(
   id_cols = encounter_id,
   names_from = within_24_hours,
   values_from = c(2:5)
 )

but I could not, as I keep getting a list for each variable instead of the number.

  encounter_id wbc_1     wbc_0     hb_1      hb_0      temp_1    temp_0    hr_1      
 hr_0   
         <dbl> <list>    <list>    <list>    <list>    <list>    <list>    <list>    
<list> 
1            1 <dbl [2]> <dbl [1]> <dbl [2]> <dbl [1]> <dbl [2]> <dbl [1]> <dbl [2]> 
<dbl [~
2            2 <dbl [2]> <dbl [1]> <dbl [2]> <dbl [1]> <dbl [2]> <dbl [1]> <dbl [2]> 
<dbl [~
3            3 <dbl [3]> <NULL>    <dbl [3]> <NULL>    <dbl [3]> <NULL>    <dbl [3]> 
<NULL> 
4            4 <dbl [1]> <dbl [2]> <dbl [1]> <dbl [2]> <dbl [1]> <dbl [2]> <dbl [1]> 
<dbl [~

Any ideas on how to proceed?

CodePudding user response:

You don't need to pivot your data.

Below is my solution that's similar to the previous answer.

library(tidyverse)
df %>% 
   filter(within_24_hours == 1) %>% # Get only encounters within 24 hours
   select(-within_24_hours) %>% # Drop the within 24 hours column since it won't be aggregated
   mutate(across(wbc:hr, as.numeric)) %>% # ensure the variables to aggregate are all numeric
  group_by(encounter_id) %>%
  summarise(
    
    across(everything(),
           list(mean=~mean(.x, na.rm = TRUE),
                median=~median(.x, na.rm = TRUE), 
                max=~max(.x, na.rm = TRUE),
                min=~min(.x, na.rm = TRUE)),
           .names = "{.col}_{.fn}"
           )
    
    )

Result:

# A tibble: 4 x 17
  encounter_id wbc_mean wbc_median wbc_max wbc_min hb_mean hb_median hb_max hb_min temp_mean temp_median temp_max temp_min hr_mean hr_median hr_max
         <dbl>    <dbl>      <dbl>   <dbl>   <dbl>   <dbl>     <dbl>  <dbl>  <dbl>     <dbl>       <dbl>    <dbl>    <dbl>   <dbl>     <dbl>  <dbl>
1            1       13         13      13      13    12.5      12.5     13     12      98.5        98.5      100       97    132.      132.    133
2            2      NaN         NA    -Inf     Inf    13        13       13     13     103         103        103      103    NaN        NA    -Inf
3            3       13         15      15       9    15        15       15     15      99.5        99.5      100       99    158.      158.    160
4            4       12         12      12      12    12        12       12     12     101         101        101      101     70        70      70

CodePudding user response:

I think if you want to group by encounter_id and variable, and filter by within_24_hours, a better approach might be to pivot longer, not wider.

Note that using your example data some functions will return Inf or -Inf and a warning where na.rm = TRUE is used, because for some groups all values are NA.

library(dplyr)
library(tidyr)

df %>% 
  pivot_longer(cols = -c(encounter_id, within_24_hours)) %>%
  filter(within_24_hours == 1) %>% 
  group_by(encounter_id, name) %>% 
  summarise(meanVal = mean(value, na.rm = TRUE), 
            medianVal = median(value, na.rm = TRUE), 
            maxVal = max(value, na.rm = TRUE), 
            minVal = min(value, na.rm = TRUE)) %>% 
ungroup()

Result:

   encounter_id name  meanVal medianVal maxVal minVal
          <dbl> <chr>   <dbl>     <dbl>  <dbl>  <dbl>
 1            1 hb       12.5      12.5     13     12
 2            1 hr      132.      132.     133    132
 3            1 temp     98.5      98.5    100     97
 4            1 wbc      13        13       13     13
 5            2 hb       13        13       13     13
 6            2 hr      NaN        NA     -Inf    Inf
 7            2 temp    103       103      103    103
 8            2 wbc     NaN        NA     -Inf    Inf
 9            3 hb       15        15       15     15
10            3 hr      158.      158.     160    155
11            3 temp     99.5      99.5    100     99
12            3 wbc      13        15       15      9
13            4 hb       12        12       12     12
14            4 hr       70        70       70     70
15            4 temp    101       101      101    101
16            4 wbc      12        12       12     12
  • Related