Home > Software engineering >  Linear interpolation of Panel Data in R over multiple columns
Linear interpolation of Panel Data in R over multiple columns

Time:06-02

I have a dataset with population numbers by country, disaggregated by gender and age brackets. The data is recorded in 5 year increments. I would like to interpolate the data linearly to have annual data.

My data looks like this:

data

I have figured out how to do the interpolation for one column. This is the code I used:

n <- 1

j <- c()

while(n < nrow(pop)){
  a <- as.data.frame(approx(x = pop$`5year`[n:(n 4)], y = pop$`Male_0-4`[n:(n 4)], xout = 2000:2020))
  j <- as.matrix(c(j,a$y))
  n <- n 5
}
  

Here, j produces one long vector that would represent the relevant column in the new, interpolated dataset. I would therefore like to produce one such vector for each age-gender combo (i.e. column in the original dataset) and bind them together.

However, I cannot figure out how to do that for all my columns at once. Neither my attempts at functions, for loops or apply family members has worked. I'm thankful for any input.

CodePudding user response:

I'd write a function to interpolate an arbitrary variable within country group, and then map() it over all the variables and join them back together.

library(tidyverse)
data("population")

# create some data to interpolate
population_5 <- population %>% 
  filter(year %% 5 == 0) %>% 
  mutate(female_pop = population / 2,
         male_pop = population / 2)

interpolate_func <- function(variable, data) {
  data %>% 
    group_by(country) %>% 
    # can't interpolate if only one year
    filter(n() >= 2) %>% 
    group_modify(~as_tibble(approx(.x$year, .x[[variable]], 
                                   xout = min(.x$year):max(.x$year)))) %>% 
    set_names("country", "year", paste0(variable, "_interpolated")) %>% 
    ungroup()
}

vars_to_interpolate <- names(select(population_5, -country, -year))

map(vars_to_interpolate, interpolate_func, 
    data = population_5) %>% 
  reduce(full_join, by = c("country", "year"))

#> # A tibble: 3,395 × 5
#>    country      year population_interpolated female_pop_interp… male_pop_interp…
#>    <chr>       <int>                   <dbl>              <dbl>            <dbl>
#>  1 Afghanistan  1995               17586073            8793036.         8793036.
#>  2 Afghanistan  1996               18187930.           9093965.         9093965.
#>  3 Afghanistan  1997               18789788.           9394894.         9394894.
#>  4 Afghanistan  1998               19391645.           9695823.         9695823.
#>  5 Afghanistan  1999               19993503.           9996751.         9996751.
#>  6 Afghanistan  2000               20595360           10297680         10297680 
#>  7 Afghanistan  2001               21448459           10724230.        10724230.
#>  8 Afghanistan  2002               22301558           11150779         11150779 
#>  9 Afghanistan  2003               23154657           11577328.        11577328.
#> 10 Afghanistan  2004               24007756           12003878         12003878 
#> # … with 3,385 more rows

Created on 2022-06-01 by the reprex package (v2.0.1)

  • Related