Replacing NAs with prior year value for specific country in panel data-CodePudding

I have merged two dataframes, call them A and B. One has values for important variables for every year with some missing data which I will deal with separately. The second has values for only certain years (election years). This is cross national panel data, with a country-year unit of observation so it is important to do distinguish both country and year in any operation. After merging, the non-election years have NA values for the data from the second dataframe as expected. These NAs need to be filled in with the data from the previous election for that specific country, until the next election in that country. I do not want to fill in any NAs for data from dataframe A.

(For those who might have theoretical questions, the data from B is about the governing party, so it is theoretically sound to fill the data in this manner.)

If I subset the data by country, I can do this easily with the tidy::fill function by selecting only the columns containing data from B. For the full dataframe with all countries, I can't do this because in some cases it will fill in the beginning years for one country with values from the previous country in the dataframe.

Here is a minimal example of the data arrangement (bearing in mind there are actually 190 countries and 9282 observations in the real data):

country <- c("Austria","Austria","Austria","Austria","Austria",
         "Belgium","Belgium","Belgium","Belgium","Belgium")
year <- c("1999","2000","2001","2002","2003",
      "1999","2000","2001","2002","2003")
a1 <- c(5,4,NA,4,3,6,2,9,NA,7)
a2 <- c(45,53,57,51,33,37,12,48,55,41)
b1 <- c(NA,"A",NA,NA,NA,NA,NA,"B",NA,"C")
b2 <- c(NA,7,NA,NA,NA,NA,NA,5,NA,7)
df <- data.frame(country,year,a1,a2,b1,b2)

country	year	a1	a2	b1	b2
Austria	1999	5	45	NA	NA
Austria	2000	4	53	A	7
Austria	2001	NA	57	NA	NA
Austria	2002	4	51	NA	NA
Austria	2003	3	33	NA	NA
Belgium	1999	6	37	NA	NA
Belgium	2000	2	12	NA	NA
Belgium	2001	9	48	B	5
Belgium	2002	NA	55	NA	NA
Belgium	2003	7	41	C	7

Here is what I want to produce:

country	year	a1	a2	b1	b2
Austria	1999	5	45	NA	NA
Austria	2000	4	53	A	7
Austria	2001	NA	57	A	7
Austria	2002	4	51	A	7
Austria	2003	3	33	A	7
Belgium	1999	6	37	NA	NA
Belgium	2000	2	12	NA	NA
Belgium	2001	9	48	B	5
Belgium	2002	NA	55	B	5
Belgium	2003	7	41	C	7

Simply using tidy::fill will result, in the example, in incorrect values for Belgium for 1999 and 2000, as it will fill in the values from Austria.

CodePudding user response：

As Peace Wang suggested in the comments, you just need to group_by(country). You can take advantage of tidy-select to specifically fill the columns from df B.

library(tidyverse)

country <- c("Austria","Austria","Austria","Austria","Austria",
             "Belgium","Belgium","Belgium","Belgium","Belgium")
year <- c("1999","2000","2001","2002","2003",
          "1999","2000","2001","2002","2003")
a1 <- c(5,4,NA,4,3,6,2,9,NA,7)
a2 <- c(45,53,57,51,33,37,12,48,55,41)
b1 <- c(NA,"A",NA,NA,NA,NA,NA,"B",NA,"C")
b2 <- c(NA,7,NA,NA,NA,NA,NA,5,NA,7)
df <- data.frame(country,year,a1,a2,b1,b2)

df %>% 
  group_by(country) %>% 
  arrange(year) %>% 
  fill(starts_with("b"), .direction = "down") %>% 
  arrange(country)
#> # A tibble: 10 x 6
#> # Groups:   country [2]
#>    country year     a1    a2 b1       b2
#>    <chr>   <chr> <dbl> <dbl> <chr> <dbl>
#>  1 Austria 1999      5    45 <NA>     NA
#>  2 Austria 2000      4    53 A         7
#>  3 Austria 2001     NA    57 A         7
#>  4 Austria 2002      4    51 A         7
#>  5 Austria 2003      3    33 A         7
#>  6 Belgium 1999      6    37 <NA>     NA
#>  7 Belgium 2000      2    12 <NA>     NA
#>  8 Belgium 2001      9    48 B         5
#>  9 Belgium 2002     NA    55 B         5
#> 10 Belgium 2003      7    41 C         7

^{Created on 2021-12-26 by the reprex package (v0.3.0)}

CodePudding user response：

I think the locf (last observation carried forward) nafill method in group country is what you want.

library(data.table)
df = setDT(df)

cols = c("b1","b2")
df[,(cols):= lapply(.SD, zoo::na.locf, na.rm = FALSE),
    .SDcols = cols, 
    by = .(country)]

# data.table::nafill now can only process numeric columns, e.g. 
# df[, b2 := nafill(b2, type = c("locf"), by = .(country)]