Currently working with some time series data and I want to find the first value of each variable recorded for individuals, with a fair bit of missing data.
I am trying to create a summary of the first (non NA) value for each of a series of variables within a tibble. (Most of my work uses tidyverse packages, but happy to try data.table or other approaches, mine is already a hybrid/bastardisation))
Example Data:
id time val1 val2
<chr> <dbl> <dbl> <dbl>
1 a 2 NA x
2 a 3 x NA
3 b 1 y y
4 b 3 NA w
5 c 3 NA NA
6 c 5 w q
Example code:
tibble(
id = c(a,a,b,b,c,c),
time=c(2,3,1,3,3,5),
val1=c(NA,x,y,NA,NA,w),
val2=c(x,NA,y,w,NA,q)) %>%
group_by(id)%>%
summarise(across(starts_with('val'), ~ .x[which.min(time)]))
But it also returns NA values.
# A tibble: 3 x 3
id val1 val2
<chr> <chr> <chr>
1 a NA x
2 b y y
3 c NA NA
The return I am hoping for is:
# A tibble: 3 x 3
id val1 val2
<chr> <chr> <chr>
1 a x x
2 b y y
3 c w q
I haven't used the across() very much since its introduction. But this seems like it should be simple and I'm just being a bit slow! My attempts to string a is.na() in there somewhere all seem to fail. I am trying to do this across quite a few variables, hence my desire to use across.
Hopefully this is enough info. Thanks in advance for any help!
EDIT: in my actual data the time variable is a dtrn variable but I also have dttm data.
CodePudding user response:
The below I believe achieves what you're after.
df %>% summarise(across(starts_with('val'), ~ .x[!is.na(.x)][which.min(time)]))
# A tibble: 3 x 3
id val1 val2
<chr> <chr> <chr>
1 a x x
2 b y y
3 c w q
See also a dtplyr
solution keeping imo the best of both worlds: a clear dplyr
-style syntax with the speed of data.table
.
df <- tibble(
id = c('a','a','b','b','c','c'),
time=c(2,3,1,3,3,5),
val1=c(NA,'x','y',NA,NA,'w'),
val2=c('x',NA,'y','w',NA,'q'))
# original sol ------------------------------------------------------------
df %>%
group_by(id) %>%
summarise(across(starts_with('val'), ~ .x[!is.na(.x)][which.min(time)]))
# dtplyr sol --------------------------------------------------------------
library(dtplyr)
df %>% lazy_dt %>%
group_by(id) %>%
summarise(across(starts_with('val'), ~ .x[!is.na(.x)][which.min(time)])) %>%
as_tibble()
The data isn't large enough to see much in terms of time improvements for this example however.
benchmarking
library(microbenchmark)
microbenchmark(
dplyr = df %>% group_by(id) %>% summarise(across(starts_with('val'), ~ .x[!is.na(.x)][which.min(time)])),
dtplyr = df %>% lazy_dt %>% group_by(id) %>% summarise(across(starts_with('val'), ~ .x[!is.na(.x)][which.min(time)]))
)
Unit: milliseconds
expr min lq mean median uq max neval
dplyr 4.7532 5.18185 5.733455 5.42160 5.5945 22.7650 100
dtplyr 4.7024 5.09795 5.428136 5.27685 5.5639 10.7739 100