I have merged two dataframes, call them A and B. One has values for important variables for every year with some missing data which I will deal with separately. The second has values for only certain years (election years). This is cross national panel data, with a country-year unit of observation so it is important to do distinguish both country and year in any operation. After merging, the non-election years have NA values for the data from the second dataframe as expected. These NAs need to be filled in with the data from the previous election for that specific country, until the next election in that country. I do not want to fill in any NAs for data from dataframe A.
(For those who might have theoretical questions, the data from B is about the governing party, so it is theoretically sound to fill the data in this manner.)
If I subset the data by country, I can do this easily with the tidy::fill function by selecting only the columns containing data from B. For the full dataframe with all countries, I can't do this because in some cases it will fill in the beginning years for one country with values from the previous country in the dataframe.
Here is a minimal example of the data arrangement (bearing in mind there are actually 190 countries and 9282 observations in the real data):
country <- c("Austria","Austria","Austria","Austria","Austria",
"Belgium","Belgium","Belgium","Belgium","Belgium")
year <- c("1999","2000","2001","2002","2003",
"1999","2000","2001","2002","2003")
a1 <- c(5,4,NA,4,3,6,2,9,NA,7)
a2 <- c(45,53,57,51,33,37,12,48,55,41)
b1 <- c(NA,"A",NA,NA,NA,NA,NA,"B",NA,"C")
b2 <- c(NA,7,NA,NA,NA,NA,NA,5,NA,7)
df <- data.frame(country,year,a1,a2,b1,b2)
country | year | a1 | a2 | b1 | b2 |
---|---|---|---|---|---|
Austria | 1999 | 5 | 45 | NA | NA |
Austria | 2000 | 4 | 53 | A | 7 |
Austria | 2001 | NA | 57 | NA | NA |
Austria | 2002 | 4 | 51 | NA | NA |
Austria | 2003 | 3 | 33 | NA | NA |
Belgium | 1999 | 6 | 37 | NA | NA |
Belgium | 2000 | 2 | 12 | NA | NA |
Belgium | 2001 | 9 | 48 | B | 5 |
Belgium | 2002 | NA | 55 | NA | NA |
Belgium | 2003 | 7 | 41 | C | 7 |
Here is what I want to produce:
country | year | a1 | a2 | b1 | b2 |
---|---|---|---|---|---|
Austria | 1999 | 5 | 45 | NA | NA |
Austria | 2000 | 4 | 53 | A | 7 |
Austria | 2001 | NA | 57 | A | 7 |
Austria | 2002 | 4 | 51 | A | 7 |
Austria | 2003 | 3 | 33 | A | 7 |
Belgium | 1999 | 6 | 37 | NA | NA |
Belgium | 2000 | 2 | 12 | NA | NA |
Belgium | 2001 | 9 | 48 | B | 5 |
Belgium | 2002 | NA | 55 | B | 5 |
Belgium | 2003 | 7 | 41 | C | 7 |
Simply using tidy::fill will result, in the example, in incorrect values for Belgium for 1999 and 2000, as it will fill in the values from Austria.
CodePudding user response:
As Peace Wang suggested in the comments, you just need to group_by(country)
. You can take advantage of tidy-select
to specifically fill
the columns from df B.
library(tidyverse)
country <- c("Austria","Austria","Austria","Austria","Austria",
"Belgium","Belgium","Belgium","Belgium","Belgium")
year <- c("1999","2000","2001","2002","2003",
"1999","2000","2001","2002","2003")
a1 <- c(5,4,NA,4,3,6,2,9,NA,7)
a2 <- c(45,53,57,51,33,37,12,48,55,41)
b1 <- c(NA,"A",NA,NA,NA,NA,NA,"B",NA,"C")
b2 <- c(NA,7,NA,NA,NA,NA,NA,5,NA,7)
df <- data.frame(country,year,a1,a2,b1,b2)
df %>%
group_by(country) %>%
arrange(year) %>%
fill(starts_with("b"), .direction = "down") %>%
arrange(country)
#> # A tibble: 10 x 6
#> # Groups: country [2]
#> country year a1 a2 b1 b2
#> <chr> <chr> <dbl> <dbl> <chr> <dbl>
#> 1 Austria 1999 5 45 <NA> NA
#> 2 Austria 2000 4 53 A 7
#> 3 Austria 2001 NA 57 A 7
#> 4 Austria 2002 4 51 A 7
#> 5 Austria 2003 3 33 A 7
#> 6 Belgium 1999 6 37 <NA> NA
#> 7 Belgium 2000 2 12 <NA> NA
#> 8 Belgium 2001 9 48 B 5
#> 9 Belgium 2002 NA 55 B 5
#> 10 Belgium 2003 7 41 C 7
Created on 2021-12-26 by the reprex package (v0.3.0)
CodePudding user response:
I think the locf
(last observation carried forward) nafill method in group country
is what you want.
library(data.table)
df = setDT(df)
cols = c("b1","b2")
df[,(cols):= lapply(.SD, zoo::na.locf, na.rm = FALSE),
.SDcols = cols,
by = .(country)]
# data.table::nafill now can only process numeric columns, e.g.
# df[, b2 := nafill(b2, type = c("locf"), by = .(country)]