I have dataframe df of which a sample is shown below. I need to combine the rows by study_id, lab_study_dt, and lab_study_time, and have NA and non-NA values across labs and detection limits on the same row grouped by those three key variables.
I've tried group by and summarise_all to do this but didn't get the outcome I'm looking for.
df %>%
group_by(study_id,lab_study_dt,lab_study_time) %>%
summarise_all(funs(toString(na.omit(.))))
study_id | lab_study_dt | lab_study_time | lab_polys | lab_lymphs | lab_mono | lab_eos | lab_basos | lab_bands | lab_wbc_count | lab_rbc_count | protein_limit_of_detection | lab_protein | gluc_limit_of_detection | lab_glucose |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Jane | 8/13/2011 | 0:12 | NA | NA | NA | NA | NA | NA | 1 | NA | NA | NA | NA | NA |
Jane | 8/13/2011 | 0:12 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Jane | 3/4/2013 | 15:27 | NA | 60 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Jane | 3/4/2013 | 15:27 | NA | NA | NA | NA | NA | NA | NA | 10000 | NA | NA | NA | NA |
Jane | 3/4/2013 | 15:27 | 8 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Jane | 3/4/2013 | 15:27 | NA | NA | NA | NA | NA | 1 | NA | NA | NA | NA | NA | NA |
Jane | 3/4/2013 | 15:27 | NA | NA | NA | NA | NA | NA | 149 | NA | NA | NA | NA | NA |
Jane | 3/4/2013 | 15:27 | NA | NA | 31 | NA | NA | NA | NA | NA | 1 | 56 | NA | NA |
George | 4/20/2021 | 21:18 | NA | 60 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
George | 4/20/2021 | 21:18 | NA | NA | NA | NA | NA | NA | NA | 10000 | NA | NA | NA | NA |
George | 4/23/2021 | 15:27 | 8 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2 | 3 |
George | 4/23/2021 | 12:27 | NA | 65 | NA | NA | NA | NA | NA | NA | 1 | >10 | NA | NA |
George | 4/23/2021 | 12:27 | NA | NA | NA | NA | NA | 1 | 149 | NA | NA | NA | NA | NA |
George | 4/23/2021 | 12:27 | NA | NA | 31 | NA | NA | NA | NA | 56 | NA | NA | NA | NA |
The end dataframe would have one row per unique study_id, study date, and study time together, with all associated values along that row. So for example, entry Jane - 3/4/2013 - 15:27 would look like below:
study_id | lab_study_dt | lab_study_time | lab_polys | lab_lymphs | lab_mono | lab_eos | lab_basos | lab_bands | lab_wbc_count | lab_rbc_count | protein_limit_of_detection | lab_protein | gluc_limit_of_detection | lab_glucose |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Jane | 3/4/2013 | 15:27 | 8 | 60 | 31 | NA | NA | 1 | 149 | 1000 | 1 | 56 | NA | NA |
Thank you in advance
CodePudding user response:
We could group by the id/dt/time
, then use tidyr::fill
on all columns (ie everything()
) to take any non-NA values and copy them first down through any NAs and then up though any NAs ("downup", my arbitrary choice). Finally, we can keep just the first slice within each group and then remove the grouping.
library(tidyverse)
df %>%
group_by(study_id,lab_study_dt,lab_study_time) %>%
fill(everything(), .direction = "downup") %>%
slice(1) %>%
ungroup()
Is this the output you'd expect?
# A tibble: 5 × 15
study_id lab_study_dt lab_study_time lab_polys lab_lymphs lab_mono lab_eos lab_basos lab_bands lab_wbc_count lab_rbc_count protein_limit_of_detection lab_protein gluc_limit_of_detecti… lab_glucose
<chr> <chr> <chr> <int> <int> <int> <lgl> <lgl> <int> <int> <int> <int> <chr> <int> <int>
1 George 4/20/2021 21:18 NA 60 NA NA NA NA NA 10000 NA NA NA NA
2 George 4/23/2021 12:27 NA 65 31 NA NA 1 149 56 1 >10 NA NA
3 George 4/23/2021 15:27 8 NA NA NA NA NA NA NA NA NA 2 3
4 Jane 3/4/2013 15:27 8 60 31 NA NA 1 149 10000 1 56 NA NA
5 Jane 8/13/2011 0:12 NA NA NA NA NA NA 1 NA NA NA NA NA