Home > other >  Fill NAs with numerical sequence by group
Fill NAs with numerical sequence by group

Time:11-04

I'd like to uniquely fill NAs with a numerical sequence by group.

Here is a reproducible example:

library(missForest)
library(dplyr)

set.seed(1)
d <- data.frame(id = rep(1:3,each=5),
       year = rep(1997:2001,3))
d$year<- prodNA(d[2],noNA=0.3)
d<-arrange(d,id,year)
d
#>    id year
#> 1   1 1998
#> 2   1 1999
#> 3   1 2001
#> 4   1   NA
#> 5   1   NA
#> 6   2 1997
#> 7   2 1999
#> 8   2 2001
#> 9   2   NA
#> 10  2   NA
#> 11  3 1997
#> 12  3 1998
#> 13  3 1999
#> 14  3 2000
#> 15  3 2001

NAs should be replaced by a non-already-taken value of the numerical sequence 1997 to 2001; so that the resulted data frame looks like this:

data.frame(id = rep(1:3, each = 5), year = rep(1997:2001, 3))
#>    id year
#> 1   1 1997
#> 2   1 1998
#> 3   1 1999
#> 4   1 2000
#> 5   1 2001
#> 6   2 1997
#> 7   2 1998
#> 8   2 1999
#> 9   2 2000
#> 10  2 2001
#> 11  3 1997
#> 12  3 1998
#> 13  3 1999
#> 14  3 2000
#> 15  3 2001

CodePudding user response:

We may use coalesce

library(dplyr)
d$year <- coalesce(prodNA(d[2],noNA=0.3)$year, d$year)

If it needs to be grouped by

library(dplyr)
d %>% 
   group_by(id) %>% 
   mutate(year = coalesce(prodNA(cur_data()["year"], noNA = 0.3)$year, year)) %>%
   ungroup

-output

# A tibble: 15 × 2
      id  year
   <int> <int>
 1     1  1997
 2     1  1998
 3     1  1999
 4     1  2000
 5     1  2001
 6     2  1997
 7     2  1998
 8     2  1999
 9     2  2000
10     2  2001
11     3  1997
12     3  1998
13     3  1999
14     3  2000
15     3  2001

data

set.seed(1)
d <- data.frame(id = rep(1:3,each=5),
       year = rep(1997:2001,3))

CodePudding user response:

Base R alternative:

d$year2 <- ave(d$year, d$id, FUN = function(z) { z[is.na(z)] <- setdiff(1997:2001, z); z;})
d
#    id year year2
# 1   1 1998  1998
# 2   1 1999  1999
# 3   1 2001  2001
# 4   1   NA  1997
# 5   1   NA  2000
# 6   2 1997  1997
# 7   2 1999  1999
# 8   2 2001  2001
# 9   2   NA  1998
# 10  2   NA  2000
# 11  3 1997  1997
# 12  3 1998  1998
# 13  3 1999  1999
# 14  3 2000  2000
# 15  3 2001  2001

ave is a simple function to do operations on a vector (first arg) based on grouping criteria (a list-like second arg). The function is given a single vector of values for one group at a time, so with this data the anon-func is called three times. The only gotcha with ave is that ave always coerces the function's return value to be the same class as the input vector, which can be frustrating and nonsensical (I'm sure there was reason for that behavior at some point).

Knowing that z at any one time is all of year for one group at a time, the inner workings of that function are simple: replace all NA values with the values missing from 1997:2001. This has a huge caveat, though: this presumes that the only legitimate values in z are c(NA, 1997:2001); if there is ever something else, this may very fail/explode or cause your neighbor's car to mis-fire.

In that sense, this method is a little fragile. To be more resilient, we'd likely need to have some allowances. For instance, is it okay to leave values outside of 1997:2001 in the data, or should they be replaced as well? If not, then is there a heuristic for determining which of the missing values are imputed into the NA values?


Data

d <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), year = c(1998L, 1999L, 2001L, NA, NA, 1997L, 1999L, 2001L, NA, NA, 1997L, 1998L, 1999L, 2000L, 2001L)), class = "data.frame", row.names = c(NA, -15L))
  • Related