I have four people who are followed for four years. I would like to replace the NA by 0, but only for people who has a positive value in workhours on the first wave they were interviewed. For example, in my data, this means that the persons with ID 3 and 4 will have their data replaced by 0, but the person with ID 2 will keep his/her NA.
id wave year work_hours
1 1 2007 40
1 2 2008 39
1 3 2009 39
1 4 2010 38
2 1 2005 NA
2 2 2006 35
2 3 2007 35
2 4 2008 NA
3 1 2007 40
3 2 2008 NA
3 3 2009 40
3 4 2010 40
4 1 2009 32
4 2 2010 NA
4 3 2011 32
4 4 2012 NA
I tried the following code, but it is replacing the first wave with 0 but not the waves that follows:
df= df %>% group_by(id) %>%
mutate(workhours_imputed= ifelse(work_hours>0 & wave==1, replace_na(0), work_hours))
Here Is the Data:
structure(list(id = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4,
4, 4, 4), wave = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2,
3, 4), year = c(2007, 2008, 2009, 2010, 2005, 2006, 2007, 2008,
2007, 2008, 2009, 2010, 2009, 2010, 2011, 2012), work_hours = c(40,
39, 39, 38, NA, 35, 35, NA, 40, NA, 40, 40, 32, NA, 32, NA),
workhours_imputed = c(0, 39, 39, 38, NA, 35, 35, NA, 0, NA,
40, 40, 0, NA, 32, NA)), row.names = c(NA, -16L), groups = structure(list(
id = c(1, 2, 3, 4), .rows = structure(list(1:4, 5:8, 9:12,
13:16), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
CodePudding user response:
One way to do this using match
-
library(dplyr)
df %>%
group_by(id) %>%
mutate(workhours_imputed = {
tmp <- work_hours[match(1, wave)]
#If the 1st wave has a positive value
#replace NA with 0
if(!is.na(tmp) && tmp > 0) replace(work_hours, is.na(work_hours), 0) else work_hours
})
# id wave year work_hours workhours_imputed
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 2007 40 40
# 2 1 2 2008 39 39
# 3 1 3 2009 39 39
# 4 1 4 2010 38 38
# 5 2 1 2005 NA NA
# 6 2 2 2006 35 35
# 7 2 3 2007 35 35
# 8 2 4 2008 NA NA
# 9 3 1 2007 40 40
#10 3 2 2008 NA 0
#11 3 3 2009 40 40
#12 3 4 2010 40 40
#13 4 1 2009 32 32
#14 4 2 2010 NA 0
#15 4 3 2011 32 32
#16 4 4 2012 NA 0
CodePudding user response:
Alternative dplyr solution:
df %>%
mutate(workhours_imputed = if_else(
is.na(work_hours) & any(wave == 1 & !is.na(work_hours)),
0, work_hours)
)
# # A tibble: 16 x 5
# # Groups: id [4]
# id wave year work_hours workhours_imputed
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 2007 40 40
# 2 1 2 2008 39 39
# 3 1 3 2009 39 39
# 4 1 4 2010 38 38
# 5 2 1 2005 NA NA
# 6 2 2 2006 35 35
# 7 2 3 2007 35 35
# 8 2 4 2008 NA NA
# 9 3 1 2007 40 40
# 10 3 2 2008 NA 0
# 11 3 3 2009 40 40
# 12 3 4 2010 40 40
# 13 4 1 2009 32 32
# 14 4 2 2010 NA 0
# 15 4 3 2011 32 32
# 16 4 4 2012 NA 0
If wave
does not always start at 1 but you always want to check the first value of wave
, then you can use this instead:
df %>%
mutate(workhours_imputed = if_else(
is.na(work_hours) & !is.na(work_hours[which.min(wave)]),
0, work_hours)
)