Home > Net >  How to replace missing values for observations that are missing on the first event?
How to replace missing values for observations that are missing on the first event?

Time:10-02

I am imputing a social class variable in a panel data. I want to replace all the social class variables with the first social class that appears on the first wave. However, some of the individuals have missing data on the first wave. I would like to also impute the social class of these individuals but only following the first time the social class variable appears. If i use the following command:

df %>% 
  group_by(id) %>%
  dplyr::mutate(class_imputed = first(class))

# Groups:   id [4]
      id  wave  year class class_imputed
   <dbl> <dbl> <dbl> <fct> <fct>        
 1     1     1  2007 3     3            
 2     1     2  2008 2     3            
 3     1     3  2009 2     3            
 4     1     4  2010 2     3            
 5     2     1  2005 .     .            
 6     2     2  2006 2     .            
 7     2     3  2007 3     .            
 8     2     4  2008 .     .            
 9     3     1  2007 .     .            
10     3     2  2008 .     .            
11     3     3  2009 1     .            
12     3     4  2010 1     .            
13     4     1  2009 2     2            
14     4     2  2010 .     2            
15     4     3  2011 3     2            
16     4     4  2012 .     2    

We can see that people with ID 2 and 3 have not been imputed. But i would like only to impute for the period the first time the social class variable was observed. This means that the person with ID number 2 will get social class = 2 in only 2006, 2007 and 2008 but not in 2005. Could someone help please?

Here is the data:

structure(list(id = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 
4, 4, 4), wave = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 
3, 4), year = c(2007, 2008, 2009, 2010, 2005, 2006, 2007, 2008, 
2007, 2008, 2009, 2010, 2009, 2010, 2011, 2012), class = structure(c(4L, 
3L, 3L, 3L, 1L, 3L, 4L, 1L, 1L, 1L, 2L, 2L, 3L, 1L, 4L, 1L), .Label = c(".", 
"1", "2", "3"), class = "factor"), class_imputed = structure(c(3L, 
3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c(".", 
"2", "3"), class = "factor")), row.names = c(NA, -16L), class = "data.frame")
 

CodePudding user response:

You may try this option -

library(dplyr)

df %>%
  group_by(id) %>%
  dplyr::mutate(class_imputed = replace(class, 
                      row_number() > match(TRUE, class != '.'), 
                      class[class != '.'][1]))

#     id  wave  year class class_imputed
#   <dbl> <dbl> <dbl> <fct> <fct>        
# 1     1     1  2007 3     3            
# 2     1     2  2008 2     3            
# 3     1     3  2009 2     3            
# 4     1     4  2010 2     3            
# 5     2     1  2005 .     .            
# 6     2     2  2006 2     2            
# 7     2     3  2007 3     2            
# 8     2     4  2008 .     2            
# 9     3     1  2007 .     .            
#10     3     2  2008 .     .            
#11     3     3  2009 1     1            
#12     3     4  2010 1     1            
#13     4     1  2009 2     2            
#14     4     2  2010 .     2            
#15     4     3  2011 3     2            
#16     4     4  2012 .     2     

match(TRUE, class != '.') will return the 1st position where the value is other than '.' and class[class != '.'][1] will return the value to be imputed i.e first social class variable.

CodePudding user response:

library(tidyverse)

data <- structure(list(id = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 
                              4, 4, 4), wave = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 
                                                 3, 4), year = c(2007, 2008, 2009, 2010, 2005, 2006, 2007, 2008, 
                                                                 2007, 2008, 2009, 2010, 2009, 2010, 2011, 2012), class = structure(c(4L, 
                                                                                                                                      3L, 3L, 3L, 1L, 3L, 4L, 1L, 1L, 1L, 2L, 2L, 3L, 1L, 4L, 1L), .Label = c(".", 
                                                                                                                                                                                                              "1", "2", "3"), class = "factor"), class_imputed = structure(c(3L, 
                                                                                                                                                                                                                                                                             3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c(".", 
                                                                                                                                                                                                                                                                                                                                                     "2", "3"), class = "factor")), row.names = c(NA, -16L), class = "data.frame")

data %>%
  na_if(".") %>%
  mutate(class_imputed = class) %>%
  fill(class_imputed)
#>    id wave year class class_imputed
#> 1   1    1 2007     3             3
#> 2   1    2 2008     2             2
#> 3   1    3 2009     2             2
#> 4   1    4 2010     2             2
#> 5   2    1 2005  <NA>             2
#> 6   2    2 2006     2             2
#> 7   2    3 2007     3             3
#> 8   2    4 2008  <NA>             3
#> 9   3    1 2007  <NA>             3
#> 10  3    2 2008  <NA>             3
#> 11  3    3 2009     1             1
#> 12  3    4 2010     1             1
#> 13  4    1 2009     2             2
#> 14  4    2 2010  <NA>             2
#> 15  4    3 2011     3             3
#> 16  4    4 2012  <NA>             3

Created on 2021-10-01 by the reprex package (v2.0.1)

  • Related