I am trying to group my dataframe and set the last N values of a column in each group to NA. I can do it for N = 1 like so:
df %>% group_by(ID) %>% mutate(target = c(target[-n()], NA))
But am struggling to get it to work for any N
This is my current attempt:
df %>% group_by(ID) %>% mutate(target = c(target[1:(abs(n()-1))], NA))
But this seems to fail for groups of size 1
I also tried:
df %>% group_by(ID) %>% mutate(target = ifelse(n()==1, target, c(target[1:(abs(n()-1))], NA)))
But the else clause never takes effect.
Any advice would be appreciated, thanks.
CodePudding user response:
We could use
library(dplyr)
df %>%
group_by(ID) %>%
mutate(target = replace(target, tail(row_number(), N), NA))
CodePudding user response:
You can use case_when()
for a vectorized solution in dplyr
.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(
target = case_when(row_number() <= n() - N ~ target)
)
My thanks to @akrun for pointing out that case_when()
defaults to an NA
of the proper type, so case_when()
automatically fills the last N
with NA
s.
Update
The solution by @akrun is more performant: when benchmarked at times = 50
repetitions apiece
library(microbenchmark)
big_df <- tibble(ID = rep(letters, 100000)) %>%
mutate(target = row_number()) %>%
group_by(ID)
microbenchmark(
times = 50,
Greg = {
big_df %>%
mutate(
target = case_when(row_number() <= n() - N ~ target)
)
},
akrun = {
big_df %>%
mutate(
target = replace(target, tail(row_number(), N), NA)
)
}
)
it is about 35% faster than mine at scale (2600000
rows):
Unit: milliseconds
expr min lq mean median uq max neval
Greg 82.6337 90.9669 128.93278 96.35760 213.3593 258.8570 50
akrun 52.4519 55.8314 63.40997 61.43945 64.1082 196.4069 50