I have two country data sets, the first one goes from 2012 to 2020, and looks like this:
> head(y)
Country 2012 2013 2014 2015 2016 2017 2018 2019 2020
1 Afghanistan 0 0.0 0 0 0.00 4.00 7.00 22.00 26.00
2 Albania 0 0.0 0 35 80.20 85.30 85.50 95.00 98.38
3 Algeria 0 0.0 0 0 3.62 30.49 52.84 53.63 76.18
4 Andorra 50 50.0 50 50 50.00 85.00 85.00 85.00 85.00
5 Angola 7 7.0 7 7 8.00 8.00 8.00 18.00 30.00
6 Antigua and Barbuda 65 78.6 80 98 99.00 99.00 99.00 99.00 99.00
and the second goes from 2012 to 2030, and looks like this:
> head(x)
CountryName CountryCode CountryCodeID ProductID Y2012 Y2013 Y2014 Y2015 Y2016 Y2017 Y2018 Y2019 Y2020 Y2021 Y2022 Y2023 Y2024 Y2025 Y2026 Y2027 Y2028 Y2029 Y2030
1 Antigua AA 157 178271 98.000000 98.00 98.00 98.35281 99.22350 99.30797 99.38018 99.44286 99.49796 99.54694 99.59088 99.63063 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000
2 Albania AB 158 178271 99.000000 99.00 99.00 99.00000 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000
3 Angola AG 161 178271 47.000000 53.00 46.00 60.00000 70.12252 78.48148 84.95833 89.19728 92.40534 95.10876 96.64146 97.54840 98.10920 98.46371 99.36144 99.42651 99.48353 99.53406 99.57929
4 Azerbaijan AJ 163 178271 96.500000 96.90 97.10 97.30000 98.23938 98.78344 99.10162 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000
5 Algeria AL 164 178271 1.142043 6.00 23.23 46.00000 56.33254 66.41615 75.59252 82.67079 88.15954 91.56166 94.41373 96.92251 98.01697 98.65143 99.02451 100.00000 100.00000 100.00000 100.00000
6 Armenia AM 165 178271 96.300000 97.68 98.80 99.90000 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000 100.00000
I want to run a linear regression like lm(y ~ x)
for each country, for dates between 2012 and 2020, and then predict y
for 2021 up to 2030 based on x
of the same period.
Can anyone help me achieve this? Since I have 139 countries (rows), it fastidious to extract each country from both data sets and combine them in one data frame and then run the regression and the prediction, something more efficient would help a long way.
CodePudding user response:
Firstly, it's better to convert your data into tidy format so that there's one row for each year's observation. I'll demonstrate it on a simulated data:
library(dplyr)
library(tidyr)
N_COUNTRIES <- 3
YEARS <- 2012:2020
N_YEARS <- length(YEARS)
# simulate x data
countries <- LETTERS[1:N_COUNTRIES]
mat_x <- matrix(runif(N_COUNTRIES*N_YEARS, 0, 100), nrow = N_COUNTRIES)
colnames(mat_x) <- YEARS
df_x <- bind_cols(country = countries, mat_x)
# simulate y data
mat_y <- matrix(runif(N_COUNTRIES*N_YEARS, 0, 50), nrow = N_COUNTRIES)
colnames(mat_y) <- paste0("Y", YEARS)
df_y <- bind_cols(
country = countries,
some_other_column = sample(1:100, N_COUNTRIES), mat_y
)
# convert df_x to long format
df_x_long <-
df_x %>%
pivot_longer(-country, names_to = "year", values_to = "x") %>%
mutate(across(year, as.numeric))
head(df_x_long)
#> # A tibble: 6 × 3
#> country year x
#> <chr> <dbl> <dbl>
#> 1 A 2012 67.6
#> 2 A 2013 71.9
#> 3 A 2014 48.7
#> 4 A 2015 20.0
#> 5 A 2016 13.3
#> 6 A 2017 62.4
# convert df_y to long format
df_y_long <-
df_y %>%
select(country, starts_with("Y")) %>%
pivot_longer(-country, names_to = "year", values_to = "y") %>%
mutate(year = as.numeric(gsub("^Y", "", year)))
Now it's easy to merge x and y observations together.
# join
df_xy_long <- inner_join(df_x_long, df_y_long, by = c("country", "year"))
head(df_xy_long)
#> # A tibble: 6 × 4
#> country year x y
#> <chr> <dbl> <dbl> <dbl>
#> 1 A 2012 67.6 13.6
#> 2 A 2013 71.9 19.5
#> 3 A 2014 48.7 35.0
#> 4 A 2015 20.0 31.9
#> 5 A 2016 13.3 16.7
#> 6 A 2017 62.4 47.4
Having this we can use tidyr::nest()
to kind of put observations for each country into separate boxes to then run lm
on each of them.
# nest and model
df_xy_nested <-
df_xy_long %>%
group_by(country) %>%
nest() %>%
rowwise() %>%
mutate(model = list(lm(y ~ x, data = data))) %>%
ungroup()
head(df_xy_nested)
#> # A tibble: 3 × 3
#> country data model
#> <chr> <list> <list>
#> 1 A <tibble [9 × 3]> <lm>
#> 2 B <tibble [9 × 3]> <lm>
#> 3 C <tibble [9 × 3]> <lm>
Each element in the data
column is a dataframe itself and each element in the model
column is an lm
model.
df_xy_nested$data[[1]]
#> # A tibble: 9 × 3
#> year x y
#> <dbl> <dbl> <dbl>
#> 1 2012 67.6 13.6
#> 2 2013 71.9 19.5
#> 3 2014 48.7 35.0
#> 4 2015 20.0 31.9
#> 5 2016 13.3 16.7
#> 6 2017 62.4 47.4
#> 7 2018 38.1 40.9
#> 8 2019 96.1 36.6
#> 9 2020 24.8 2.57
df_xy_nested$model[[1]]
#>
#> Call:
#> lm(formula = y ~ x, data = data)
#>
#> Coefficients:
#> (Intercept) x
#> 19.3173 0.1587
You might want to extract model coefficients like this:
df_xy_nested %>%
rowwise() %>%
mutate(
coef_intercept = model$coefficients[[1]],
coef_x = model$coefficients[[2]]
)
#> # A tibble: 3 × 5
#> # Rowwise:
#> country data model coef_intercept coef_x
#> <chr> <list> <list> <dbl> <dbl>
#> 1 A <tibble [9 × 3]> <lm> 19.3 0.159
#> 2 B <tibble [9 × 3]> <lm> 29.9 -0.161
#> 3 C <tibble [9 × 3]> <lm> 18.3 0.0427
Created on 2022-08-03 by the reprex package (v2.0.1)
For some more details about modeling on nested data I'd recommend you to read:
a paragraph on rowwise() documentation (very short)
a chapter in R4DS (not short at all)