I have a very large (~30M observations) dataframe in R and I am having trouble with a new column I want to create.
The data is formatted like this:
Country Year Value
1 A 2000 1
2 A 2001 NA
3 A 2002 2
4 B 2000 4
5 B 2001 NA
6 B 2002 NA
7 B 2003 3
My problem is that I would like to impute the NAs in the value column based on other values in that column. Specifically, if there is a non-NA value for the same country I would like that to replace the NA in later years, until there is another non-NA value.
The data above would therefore be transformed into this:
Country Year Value
1 A 2000 1
2 A 2001 1
3 A 2002 2
4 B 2000 4
5 B 2001 4
6 B 2002 4
7 B 2003 3
To solve this, I first tried using a loop with a lookup function and also some if_else statements, but wasn't able to get it to behave as I expected. In general, I am struggling to find an efficient solution that will be able to perform the task in the order of minutes-hours and not days.
Is there an easy way to do this?
Thanks!
CodePudding user response:
Using tidyr's fill
:
library(tidyverse)
df %>%
group_by(Country) %>%
fill(Value)
Result:
# A tibble: 7 × 3
# Groups: Country [2]
Country Year Value
<chr> <dbl> <dbl>
1 A 2000 1
2 A 2001 1
3 A 2002 2
4 B 2000 4
5 B 2001 4
6 B 2002 4
7 B 2003 3