Home > Net >  Retrieve first non NA value based on another (time) variable
Retrieve first non NA value based on another (time) variable

Time:08-09

Currently working with some time series data and I want to find the first value of each variable recorded for individuals, with a fair bit of missing data.

I am trying to create a summary of the first (non NA) value for each of a series of variables within a tibble. (Most of my work uses tidyverse packages, but happy to try data.table or other approaches, mine is already a hybrid/bastardisation))

Example Data:

     id  time  val1  val2
  <chr> <dbl> <dbl> <dbl>
1     a     2    NA     x
2     a     3     x    NA
3     b     1     y     y
4     b     3    NA     w
5     c     3    NA    NA
6     c     5     w     q

Example code:

tibble(
id = c(a,a,b,b,c,c), 
time=c(2,3,1,3,3,5), 
val1=c(NA,x,y,NA,NA,w), 
val2=c(x,NA,y,w,NA,q)) %>% 
group_by(id)%>%  
summarise(across(starts_with('val'), ~ .x[which.min(time)]))

But it also returns NA values.

# A tibble: 3 x 3
     id  val1  val2
  <chr> <chr> <chr>
1     a    NA     x
2     b     y     y
3     c    NA    NA

The return I am hoping for is:

# A tibble: 3 x 3
     id  val1  val2
  <chr> <chr> <chr>
1     a    x     x
2     b    y     y
3     c    w     q

I haven't used the across() very much since its introduction. But this seems like it should be simple and I'm just being a bit slow! My attempts to string a is.na() in there somewhere all seem to fail. I am trying to do this across quite a few variables, hence my desire to use across.

Hopefully this is enough info. Thanks in advance for any help!

EDIT: in my actual data the time variable is a dtrn variable but I also have dttm data.

CodePudding user response:

The below I believe achieves what you're after.

df %>% summarise(across(starts_with('val'), ~ .x[!is.na(.x)][which.min(time)]))
# A tibble: 3 x 3
  id    val1  val2 
  <chr> <chr> <chr>
1 a     x     x    
2 b     y     y    
3 c     w     q   

See also a dtplyr solution keeping imo the best of both worlds: a clear dplyr-style syntax with the speed of data.table.

df <- tibble(
  id = c('a','a','b','b','c','c'), 
  time=c(2,3,1,3,3,5), 
  val1=c(NA,'x','y',NA,NA,'w'), 
  val2=c('x',NA,'y','w',NA,'q')) 

# original sol ------------------------------------------------------------

df %>% 
  group_by(id) %>%  
  summarise(across(starts_with('val'), ~ .x[!is.na(.x)][which.min(time)]))

# dtplyr sol --------------------------------------------------------------

library(dtplyr)
df %>% lazy_dt %>% 
  group_by(id) %>%  
  summarise(across(starts_with('val'), ~ .x[!is.na(.x)][which.min(time)])) %>% 
  as_tibble()

The data isn't large enough to see much in terms of time improvements for this example however.

benchmarking

library(microbenchmark)
microbenchmark(
  dplyr = df %>% group_by(id) %>%  summarise(across(starts_with('val'), ~ .x[!is.na(.x)][which.min(time)])),
  dtplyr = df %>% lazy_dt %>% group_by(id) %>%  summarise(across(starts_with('val'), ~ .x[!is.na(.x)][which.min(time)]))
)
Unit: milliseconds
   expr    min      lq     mean  median     uq     max neval
  dplyr 4.7532 5.18185 5.733455 5.42160 5.5945 22.7650   100
 dtplyr 4.7024 5.09795 5.428136 5.27685 5.5639 10.7739   100
  • Related