Home > Blockchain >  Is there an R function for imputing missing year values, consecutively, by group?
Is there an R function for imputing missing year values, consecutively, by group?

Time:12-09

My dataframe looks like:

df <- data.frame(ID=c("A", "A", "A", "A", 
                      "B", "B", "B", "B",
                      "C", "C", "C", "C",
                      "D", "D", "D", "D"),
                 grade=c("KG", "01", "02", "03",
                         "KG", "01", "02", "03",
                         "KG", "01", "02", "03",
                         "KG", "01", "02", "03"),
                 year=c(2002, 2003, NA, 2005,
                        2007, NA, NA, 2010,
                        NA, 2005, 2006, NA,
                        2009, 2010, NA, NA))

I would like to be able to impute the missing year values by ID, with the following desired results:

wanted_df <- data.frame(ID=c("A", "A", "A", "A", 
                             "B", "B", "B", "B",
                             "C", "C", "C", "C",
                             "D", "D", "D", "D"),
                       grade=c("KG", "01", "02", "03",
                               "KG", "01", "02", "03",
                               "KG", "01", "02", "03",
                               "KG", "01", "02", "03"),
                       year=c(2002, 2003, 2004, 2005,
                              2007, 2008, 2009, 2010,
                              2004, 2005, 2006, 2007,
                              2009, 2010, 2011, 2012))

I have attempted to impute the values using:

  • lag() and lead() functions
  • Joining to a dataframe consisting of years

Neither have worked. Any help would be greatly appreciated. Thank you.

CodePudding user response:

We may use na_interpolate/na_extrapolate

library(dplyr)
# remotes::install_github("skgrange/threadr")
library(threadr)
df %>% 
   group_by(ID) %>% 
   mutate(year = na_extrapolate(na_interpolate(year))) %>%
   ungroup

-output

# A tibble: 16 × 3
   ID    grade  year
   <chr> <chr> <dbl>
 1 A     KG    2002 
 2 A     01    2003 
 3 A     02    2004 
 4 A     03    2005 
 5 B     KG    2007 
 6 B     01    2008 
 7 B     02    2009 
 8 B     03    2010 
 9 C     KG    2004.
10 C     01    2005 
11 C     02    2006 
12 C     03    2007 
13 D     KG    2009 
14 D     01    2010 
15 D     02    2011 
16 D     03    2012.

CodePudding user response:

This is not a solution that I would recommend: But this question raised another question, see here Run ifelse x times with one call (quasi iteratively until condition is fulfilled)

library(dplyr)
df %>% 
  group_by(ID) %>% 
  mutate(year= ifelse(is.na(year), lag(year) 1, year),
         year= ifelse(is.na(year), lag(year) 1, year),
         year= ifelse(is.na(year), lead(year)-1, year))
   ID    grade  year
   <chr> <chr> <dbl>
 1 A     KG     2002
 2 A     01     2003
 3 A     02     2004
 4 A     03     2005
 5 B     KG     2007
 6 B     01     2008
 7 B     02     2009
 8 B     03     2010
 9 C     KG     2004
10 C     01     2005
11 C     02     2006
12 C     03     2007
13 D     KG     2009
14 D     01     2010
15 D     02     2011
16 D     03     2012

CodePudding user response:

Could do something like this:

year_imputer <- function(years){
    # Find one non-missing data-point
    ref_indx <- which(!is.na(years))[1]
    # Make it the reference point
    ref <- years[ref_indx]
    # Get the length of the years
    years_len <- length(years)
    # Generate the sequence
    (ref - (ref_indx - 1)):(ref (years_len - ref_indx))
}
library(dplyr)
df %>% 
    group_by(ID) %>% 
    mutate(
        year = year_imputer(year)
    ) %>% 
    ungroup()

Output:

# A tibble: 16 x 3
   ID    grade  year
   <chr> <chr> <int>
 1 A     KG     2002
 2 A     01     2003
 3 A     02     2004
 4 A     03     2005
 5 B     KG     2007
 6 B     01     2008
 7 B     02     2009
 8 B     03     2010
 9 C     KG     2004
10 C     01     2005
11 C     02     2006
12 C     03     2007
13 D     KG     2009
14 D     01     2010
15 D     02     2011
16 D     03     2012

CodePudding user response:

Yet another solution:

library(dplyr)

df %>% 
  group_by(ID) %>% 
  mutate(year = c(min(year,na.rm = T) - (which.min(year)-1):0, 
                  min(year,na.rm = T)   1:(n()-which.min(year)))) %>% ungroup

#> # A tibble: 16 × 3
#>    ID    grade  year
#>    <chr> <chr> <dbl>
#>  1 A     KG     2002
#>  2 A     01     2003
#>  3 A     02     2004
#>  4 A     03     2005
#>  5 B     KG     2007
#>  6 B     01     2008
#>  7 B     02     2009
#>  8 B     03     2010
#>  9 C     KG     2004
#> 10 C     01     2005
#> 11 C     02     2006
#> 12 C     03     2007
#> 13 D     KG     2009
#> 14 D     01     2010
#> 15 D     02     2011
#> 16 D     03     2012

CodePudding user response:

You can use any (linear) interpolation function for your specific problem. E.g. from imputeTS package, but also from zoo and others. But this solution probably wouldn't work any more, when you are dealing with POSIXct data types instead of just numbers. Also worth noting, that this only works as long as every missing yeear is actually inserted as NA values (instead of just left out). For this case (missing years just left out) called implicit missing values the tsibble package has a function called fill_gaps().

library("imputeTS")
library("dplyr")

df <- data.frame(ID=c("A", "A", "A", "A", 
                      "B", "B", "B", "B",
                      "C", "C", "C", "C",
                      "D", "D", "D", "D"),
                 grade=c("KG", "01", "02", "03",
                         "KG", "01", "02", "03",
                         "KG", "01", "02", "03",
                         "KG", "01", "02", "03"),
                 year=c(2002, 2003, NA, 2005,
                        2007, NA, NA, 2010,
                        NA, 2005, 2006, NA,
                        2009, 2010, NA, NA))

df %>% group_by(ID) %>% mutate(year = na_interpolation(year)) %>% ungroup
  • Related