I am trying to group my dataframe and set the last N values of a column in each group to NA. I can do it for N = 1 like so:

df %>% group_by(ID) %>% mutate(target = c(target[-n()], NA))

But am struggling to get it to work for any N

This is my current attempt:

df %>% group_by(ID) %>% mutate(target = c(target[1:(abs(n()-1))], NA))

But this seems to fail for groups of size 1

I also tried:

df %>% group_by(ID) %>% mutate(target = ifelse(n()==1, target, c(target[1:(abs(n()-1))], NA)))

But the else clause never takes effect.

Any advice would be appreciated, thanks.

CodePudding user response：

We could use

library(dplyr)
df %>% 
   group_by(ID) %>%
   mutate(target = replace(target, tail(row_number(), N), NA))

CodePudding user response：

You can use case_when() for a vectorized solution in dplyr.

library(dplyr)

df %>%
  group_by(ID) %>%
  mutate(
    target = case_when(row_number() <= n() - N ~ target)
  )

My thanks to @akrun for pointing out that case_when() defaults to an NA of the proper type, so case_when() automatically fills the last N with NAs.

Update

The solution by @akrun is more performant: when benchmarked at times = 50 repetitions apiece

library(microbenchmark)

big_df <- tibble(ID = rep(letters, 100000)) %>%
  mutate(target = row_number()) %>%
  group_by(ID)

microbenchmark(
  times = 50,
  Greg = {
    big_df %>%
      mutate(
        target = case_when(row_number() <= n() - N ~ target)
      )
  },
  akrun = {
    big_df %>%
      mutate(
        target = replace(target, tail(row_number(), N), NA)
      )
  }
)

it is about 35% faster than mine at scale (2600000 rows):

Unit: milliseconds
  expr     min      lq      mean   median       uq      max neval
  Greg 82.6337 90.9669 128.93278 96.35760 213.3593 258.8570    50
 akrun 52.4519 55.8314  63.40997 61.43945  64.1082 196.4069    50