My dataframe looks like:
df <- data.frame(ID=c("A", "A", "A", "A",
"B", "B", "B", "B",
"C", "C", "C", "C",
"D", "D", "D", "D"),
grade=c("KG", "01", "02", "03",
"KG", "01", "02", "03",
"KG", "01", "02", "03",
"KG", "01", "02", "03"),
year=c(2002, 2003, NA, 2005,
2007, NA, NA, 2010,
NA, 2005, 2006, NA,
2009, 2010, NA, NA))
I would like to be able to impute the missing year
values by ID
, with the following desired results:
wanted_df <- data.frame(ID=c("A", "A", "A", "A",
"B", "B", "B", "B",
"C", "C", "C", "C",
"D", "D", "D", "D"),
grade=c("KG", "01", "02", "03",
"KG", "01", "02", "03",
"KG", "01", "02", "03",
"KG", "01", "02", "03"),
year=c(2002, 2003, 2004, 2005,
2007, 2008, 2009, 2010,
2004, 2005, 2006, 2007,
2009, 2010, 2011, 2012))
I have attempted to impute the values using:
lag()
andlead()
functions- Joining to a dataframe consisting of years
Neither have worked. Any help would be greatly appreciated. Thank you.
CodePudding user response:
We may use na_interpolate/na_extrapolate
library(dplyr)
# remotes::install_github("skgrange/threadr")
library(threadr)
df %>%
group_by(ID) %>%
mutate(year = na_extrapolate(na_interpolate(year))) %>%
ungroup
-output
# A tibble: 16 × 3
ID grade year
<chr> <chr> <dbl>
1 A KG 2002
2 A 01 2003
3 A 02 2004
4 A 03 2005
5 B KG 2007
6 B 01 2008
7 B 02 2009
8 B 03 2010
9 C KG 2004.
10 C 01 2005
11 C 02 2006
12 C 03 2007
13 D KG 2009
14 D 01 2010
15 D 02 2011
16 D 03 2012.
CodePudding user response:
This is not a solution that I would recommend: But this question raised another question, see here Run ifelse x times with one call (quasi iteratively until condition is fulfilled)
library(dplyr)
df %>%
group_by(ID) %>%
mutate(year= ifelse(is.na(year), lag(year) 1, year),
year= ifelse(is.na(year), lag(year) 1, year),
year= ifelse(is.na(year), lead(year)-1, year))
ID grade year
<chr> <chr> <dbl>
1 A KG 2002
2 A 01 2003
3 A 02 2004
4 A 03 2005
5 B KG 2007
6 B 01 2008
7 B 02 2009
8 B 03 2010
9 C KG 2004
10 C 01 2005
11 C 02 2006
12 C 03 2007
13 D KG 2009
14 D 01 2010
15 D 02 2011
16 D 03 2012
CodePudding user response:
Could do something like this:
year_imputer <- function(years){
# Find one non-missing data-point
ref_indx <- which(!is.na(years))[1]
# Make it the reference point
ref <- years[ref_indx]
# Get the length of the years
years_len <- length(years)
# Generate the sequence
(ref - (ref_indx - 1)):(ref (years_len - ref_indx))
}
library(dplyr)
df %>%
group_by(ID) %>%
mutate(
year = year_imputer(year)
) %>%
ungroup()
Output:
# A tibble: 16 x 3
ID grade year
<chr> <chr> <int>
1 A KG 2002
2 A 01 2003
3 A 02 2004
4 A 03 2005
5 B KG 2007
6 B 01 2008
7 B 02 2009
8 B 03 2010
9 C KG 2004
10 C 01 2005
11 C 02 2006
12 C 03 2007
13 D KG 2009
14 D 01 2010
15 D 02 2011
16 D 03 2012
CodePudding user response:
Yet another solution:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(year = c(min(year,na.rm = T) - (which.min(year)-1):0,
min(year,na.rm = T) 1:(n()-which.min(year)))) %>% ungroup
#> # A tibble: 16 × 3
#> ID grade year
#> <chr> <chr> <dbl>
#> 1 A KG 2002
#> 2 A 01 2003
#> 3 A 02 2004
#> 4 A 03 2005
#> 5 B KG 2007
#> 6 B 01 2008
#> 7 B 02 2009
#> 8 B 03 2010
#> 9 C KG 2004
#> 10 C 01 2005
#> 11 C 02 2006
#> 12 C 03 2007
#> 13 D KG 2009
#> 14 D 01 2010
#> 15 D 02 2011
#> 16 D 03 2012
CodePudding user response:
You can use any (linear) interpolation function for your specific problem. E.g. from imputeTS
package, but also from zoo
and others. But this solution probably wouldn't work any more, when you are dealing with POSIXct data types instead of just numbers.
Also worth noting, that this only works as long as every missing yeear is actually inserted as NA values (instead of just left out). For this case (missing years just left out) called implicit missing values the tsibble
package has a function called fill_gaps()
.
library("imputeTS")
library("dplyr")
df <- data.frame(ID=c("A", "A", "A", "A",
"B", "B", "B", "B",
"C", "C", "C", "C",
"D", "D", "D", "D"),
grade=c("KG", "01", "02", "03",
"KG", "01", "02", "03",
"KG", "01", "02", "03",
"KG", "01", "02", "03"),
year=c(2002, 2003, NA, 2005,
2007, NA, NA, 2010,
NA, 2005, 2006, NA,
2009, 2010, NA, NA))
df %>% group_by(ID) %>% mutate(year = na_interpolation(year)) %>% ungroup