Home > front end >  Find updated values by several groups in R
Find updated values by several groups in R

Time:10-07

I have a dataset that's columns are like this:

Forecaster CountryName ProductName X2022 X2023 file_month
Allianz    USA          Inflation   6.1   2.5   7
EIU        USA          Inflation   7.7   3.7   7
Oxford     USA          Inflation   6.5   2.9   7
Allianz    USA          Inflation   NA    3.3   8
EIU        USA          Inflation   8.3   3.7   8

Data is in long format, file_month shows the month that the data was made. Now i need to create new column that refers to updated values by Forecaster, CountryName, ProductName and either X2022 or X2023. So in this new column it should be checked if Forecaster revised it's forecast compared to one month before and if so, the new revised value should be written in the column.

Also, as there is more than one ProductName, CountryName and Forecaster in whole dataset, I would like to know how could i make this work for every of these groups.

New data would need to look something like this

Forecaster CountryName ProductName X2022 X2023 file_month update_2022 update_2023
Allianz    USA          Inflation   6.1   2.5   7          
EIU        USA          Inflation   7.7   3.7   7          
Oxford     USA          Inflation   6.5   2.9   7            
Allianz    USA          Inflation   NA    3.3   8            NA           3.3
EIU        USA          Inflation   8.3   3.7   8            8.3          NA

For code reproducability i add dput() output.

data <- structure(list(Forecaster = c("Moody's Analytics", "Oxford Economics", 
                                "Julius Baer", "Fitch Solutions", "SEB", "Allianz", "EIU", "FrontierView", 
                                "Euromonitor Int.", "Fitch Ratings", "Scope Ratings", "Allianz", 
                                "EIU", "Euromonitor Int.", "Fitch Ratings", "Fitch Solutions", 
                                "FrontierView", "Julius Baer", "Moody's Analytics", "Oxford Economics", 
                                "Scope Ratings", "SEB", "Allianz", "EIU", "Euromonitor Int.", 
                                "Fitch Ratings", "Fitch Solutions", "FrontierView", "Julius Baer", 
                                "Moody's Analytics", "Oxford Economics", "Scope Ratings", "SEB", 
                                "Allianz", "EIU", "Euromonitor Int.", "Fitch Ratings", "Fitch Solutions", 
                                "FrontierView", "Julius Baer", "Moody's Analytics", "Oxford Economics", 
                                "Scope Ratings", "SEB", "Allianz", "EIU", "Euromonitor Int.", 
                                "Fitch Ratings", "Fitch Solutions", "FrontierView", "Julius Baer", 
                                "Moody's Analytics", "Oxford Economics", "Scope Ratings"), 
CountryName = c("USA", 
"USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", 
"USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", 
"USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", 
"USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", 
"USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", 
"USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA"), 
ProductName = c("Inflation", 
"Inflation", "Inflation", "Inflation", "Inflation", "Inflation", 
"Inflation", "Inflation", "Inflation", "Inflation", "Inflation", 
"Inflation", "Inflation", "Inflation", "Inflation", "Inflation", 
"Inflation", "Inflation", "Inflation", "Inflation", "Inflation", 
"Inflation", "Inflation", "Inflation", "Inflation", "Inflation", 
"Inflation", "Inflation", "Inflation", "Inflation", "Inflation", 
"Inflation", "Inflation", "Inflation", "Inflation", "Inflation", 
"Inflation", "Inflation", "Inflation", "Inflation", "Inflation", 
"Inflation", "Inflation", "Inflation", "Inflation", "Inflation", 
"Inflation", "Inflation", "Inflation", "Inflation", "Inflation", 
"Inflation", "Inflation", "Inflation"), 
X2022 = c(6.58180388987726, 
          7.39615, 5.71530516215215, 6.5, NA, 6.1, 7.2, 6.2, 7.31, 6.9, 
          NA, 6.1, 7.7, 7.49, 6.9, 6.5, 6.2, 7.01535416655227, 6.94177167480419, 
          7.40344, NA, 7.5, 6.1, 7.7, 7.49, 6.9, 6.5, 6.2, 7.01535416655227, 
          6.94177167480419, 7.40344, NA, 7.5, NA, 8.3, 7.99, 7.8313, 7.3, 
          8.4, 7.45242025035952, 7.51681072158563, 8.04052, NA, 7.5, 8.2, 
          8.1, 8.01, 7.8313, 7.7, 8.6, 7.94553624147252, 7.82785941572661, 
          8.11193, NA), 
X2023 = c(2.62716826359282, 1.73992, 2.32590118928029, 
        2.5, NA, 2.5, 2.8, 3.3, 4.02, 2.7, NA, 2.5, 3.7, 4.01, 2.7, 2.5, 
        3.3, 2.96909559451206, 2.70488419782626, 2.01984, NA, 4.2, 2.5, 
        3.7, 4.01, 2.7, 2.5, 3.3, 2.96909559451206, 2.70488419782626, 
        2.01984, NA, 4.2, NA, 3.7, 4.21, 3.7199, 3.5, 4.9, 3.02315479834998, 
        3.10232274684858, 3.17401, NA, 4.2, 3.2, 3.8, 4, 3.7199, 3.7, 
        5.7, 3.48906158848119, 3.36435085973137, 3.27724, NA), 
file_month = c(5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 
   6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 
   8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 
   9L, 9L, 9L, 9L, 9L)), class = "data.frame", row.names = c(NA,-54L))

CodePudding user response:

Here are some notes about how things work here:

  • First we group the data set on Forecaster variable. For this purpose we use dplyr::group_split function that creates a list where each elements is your data set with a unique Forecaster value.
  • Second we use purrr::map_dfr to apply a function on each elements of the previous list. In a way that we create a custom function and apply it on each data frame. dfr suffice implies that in the end the map binds all the modified data frames together by row.
  • In this case we could consider a rolling function. There are various packages that you could use for this purpose such as zoo, runner, slider and etc. I used slider because it works well with tidyverse functions.
  • I used slider::slide_dbl function where it takes a given function and apply it on a sliding window of the size that we choose. Here we would like to compare each value corresponding to a month to its previous value so the size of our sliding window is 2 as specified by .before argument.
  • Then I wrote a custom function which is in fact a series of if, else clause to apply what you have in mind.
  • I then set .complete argument in slider_dbl to TRUE, so that the function is only applied on a complete window. It means for the first month that appears in the data set there is no output value cause there is no previous month.
  • In the end I would like to apply the entire slider_dbl function on 2 columns of each data frame so I used it inside dplyer::across.
library(tidyverse)
library(slider)

data %>%
  group_split(Forecaster) %>%
  map_dfr(~ .x %>% 
            mutate(across(c(X2022, X2023), ~ slide_dbl(., .f = ~ if(is.na(.x[2])) {
              .x[2]
            } else if(is.na(.x[1]) & !is.na(.x[2])){
              .x[2]
            } else if((.x[1] - .x[2]) != 0) {
              .x[2]
            } else {
              NA_real_
            }, .before = 1, .complete = TRUE), .names = 'Updated_{.col}')))


# A tibble: 54 × 8
   Forecaster CountryName ProductName X2022 X2023 file_month Updated_X2022 Updated_X2023
   <chr>      <chr>       <chr>       <dbl> <dbl>      <int>         <dbl>         <dbl>
 1 Allianz    USA         Inflation     6.1   2.5          5          NA            NA  
 2 Allianz    USA         Inflation     6.1   2.5          6          NA            NA  
 3 Allianz    USA         Inflation     6.1   2.5          7          NA            NA  
 4 Allianz    USA         Inflation    NA    NA            8          NA            NA  
 5 Allianz    USA         Inflation     8.2   3.2          9           8.2           3.2
 6 EIU        USA         Inflation     7.2   2.8          5          NA            NA  
 7 EIU        USA         Inflation     7.7   3.7          6           7.7           3.7
 8 EIU        USA         Inflation     7.7   3.7          7          NA            NA  
 9 EIU        USA         Inflation     8.3   3.7          8           8.3          NA  
10 EIU        USA         Inflation     8.1   3.8          9           8.1           3.8
# … with 44 more rows

I suggest you run this code line by line to check the output and get a sense of what is going on here.

  • Related