Home > Software engineering >  Calculating rate of change over rows in R in multiple large datasets
Calculating rate of change over rows in R in multiple large datasets

Time:12-16

I currently work with multiple large datasets of the same row number but different column numbers. Now I need to calculate the rate of change between columns and add it to either a new object or to the existing object to go on with my analysis.

In my research on the web I usually only encounterd people trying to figure out rate of change in a column but not between those. Is the easiest way to just flip all my data?

I am very sorry for my vague description of my problem as R and english are not my first languages.

I hope you can still show me the direction to further my understanding of R.

Thank you in advance for any tipps you might have!

CodePudding user response:

I recommend joining all the data together and then convert it into a 3NF normalized long format table:

library(tidyverse)

data1 <- tibble(
  country = c("A", "B", "C"),
  gdp_2020 = c(1, 8, 10),
  gdp_2021 = c(1, 8, 10),
  population_2010 = c(5e3, 6e3, 6e3),
  population_2020 = c(5.5e3, 6.8e3, 6e3)
)
data1
#> # A tibble: 3 x 5
#>   country gdp_2020 gdp_2021 population_2010 population_2020
#>   <chr>      <dbl>    <dbl>           <dbl>           <dbl>
#> 1 A              1        1            5000            5500
#> 2 B              8        8            6000            6800
#> 3 C             10       10            6000            6000

data2 <- tibble(
  country = c("A", "B", "C"),
  population_2021 = c(7e3, 8e3, 7e3),
  population_2022 = c(7e3, 7e3, 10e3)
)
data2
#> # A tibble: 3 x 3
#>   country population_2021 population_2022
#>   <chr>             <dbl>           <dbl>
#> 1 A                  7000            7000
#> 2 B                  8000            7000
#> 3 C                  7000           10000

list(
  data1,
  data2
) %>%
  reduce(full_join) %>%
  pivot_longer(matches("^(gdp|population)")) %>%
  separate(name, into = c("variable", "year"), sep = "_") %>%
  type_convert() %>%
  arrange(country, variable, year) %>%
  group_by(variable, country) %>%
  mutate(
    # NA for the first value because it does not have a precursor to calculate change
    change_rate = (value - lag(value)) / (year - lag(year))
  )
#> Joining, by = "country"
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   country = col_character(),
#>   variable = col_character(),
#>   year = col_double()
#> )
#> # A tibble: 18 x 5
#> # Groups:   variable, country [6]
#>    country variable    year value change_rate
#>    <chr>   <chr>      <dbl> <dbl>       <dbl>
#>  1 A       gdp         2020     1          NA
#>  2 A       gdp         2021     1           0
#>  3 A       population  2010  5000          NA
#>  4 A       population  2020  5500          50
#>  5 A       population  2021  7000        1500
#>  6 A       population  2022  7000           0
#>  7 B       gdp         2020     8          NA
#>  8 B       gdp         2021     8           0
#>  9 B       population  2010  6000          NA
#> 10 B       population  2020  6800          80
#> 11 B       population  2021  8000        1200
#> 12 B       population  2022  7000       -1000
#> 13 C       gdp         2020    10          NA
#> 14 C       gdp         2021    10           0
#> 15 C       population  2010  6000          NA
#> 16 C       population  2020  6000           0
#> 17 C       population  2021  7000        1000
#> 18 C       population  2022 10000        3000

Created on 2021-12-16 by the reprex package (v2.0.1)

Example: rate of change in the second row (gdp of country A) is 0 because it was the same in both 2021 and 2020.

  • Related