Home > Back-end >  R: how to calculate new value based on values from previous months (in separate rows)
R: how to calculate new value based on values from previous months (in separate rows)

Time:07-18

I am creating a hazard severity monitoring system and have a data frame containing a separate row for each month for the last 3 years for around 80 different counties. The columns contain a separate score value indicating the severity of each type of hazard in each county/month. Here is a subset of the data:

structure(list(county = c("Lafon", "Cueibet", "KapoetaEast", 
"KapoetaSouth", "Lafon", "Lafon"), year = c(2021, 2021, 2021, 
2022, 2021, 2021), month = c(8, 8, 12, 4, 6, 11), drought_score = c(4.6, 
4.6, 4.6, 4.6, 4.6, 4.6), accumulated_drought_score = c(NA, NA, 
NA, NA, NA, NA)), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))

One of the columns has a severity score for that county/month for the specific hazard (e.g. "drought_score" in the example above). However, I also want to calculate an accumulated score for each month, by taking the value for that county from the previous month, weighting it (multiplying it by a scale value) and adding it to the current value. I would then like to do the same for 2 months before and 3 months before. This value would be added to the "accumulated_drought_score" column in the example above.

I am very new to R and am not sure how to do this effectively.

CodePudding user response:

One way of doing this is to add new variables to your data representing the lagged values, then creating a new accumulated weighted total variable from those lags.

We'll work with a simpler, smaller data set to get the ideas across. I've simulated "scores" for 2 countries across 14 consecutive months. In the balanced version, all countries are observed in every month. In the unbalanced version, we are missing observations for some countries in some months.

library(dplyr)
set.seed(123)

# Make balanced panel data
balanced <- data.frame(
  year = c(rep(2000, 24), rep(2001, 4)),
  country = rep(c("a","b"), 14),
  score = round(rnorm(28), 2),
  month = rep(c(1:12, 1:2), each = 2)
)

balanced %>% filter(year != 2000) %>% pull(year) %>% table()

# Make unbalanced panel data
# remove February for both years for country a
# remove December for 2000 for country b
unbalanced <- balanced %>% filter(!(country == "a" & month == 2),
                                  !(country == "b" & (month == 12 & year == 2000)))

# Confirm data missing as expected
table(unbalanced$country, unbalanced$month)

The tricky aspects of calculating moving averages are (a) not all countries are observed at all times, (b) higher values of time variables (e.g., "month") don't necessarily imply values from later in time because every new year month changes from 12 to 1, and (c) we need to be aware of which observations belong to which countries.

Our first approach to generating the 1-month lags of score might be as follows.

balanced %>%
  arrange(country, year, month) %>%
  group_by(country) %>%
  mutate(lag1 = lag(score, n = 1)) %>%
  head(17)

This code sorts the data on country then year then month, then subsets the data to each country. It then creates a variable lag1 by assigning each row index k of lag1 the value of score from row index k-1. As expected, the lag variable has a missing value NA for the January 2000 since we do not have any observations earlier than it.

The issue with this approach only becomes apparent when we move to the unbalanced panel data. Let's use the same approach but pass the unbalanced data instead.

unbalanced %>%
  arrange(country, year, month) %>%
  group_by(country) %>%
  mutate(lag1 = lag(score, n = 1)) %>%
  head(17)

If we look at the second row of the output data, we can see the issue: the value of lag1 for March 2000 is not drawn from the preceding month (February 2000) because we do not observe that country at that time. Instead, the lag for March 2000 is taken from the most recent month preceding it by at least 1 month where we have an observation for that country: January 2000. That's a problem since it means our lag1 variable actually contains values reflecting lags greater than one month. While we might sometimes want to use the most recently available lagged value or similar when we have missing data, that is not our goal here.

The most straightforward solution to this issue is to pad our unbalanced dataset with rows to produce a balanced dataset with explicitly missing values for score for the combinations of country, year, and month that we do not observe - i.e., implicitly missing values. This will ensure lag1 correctly contains missing values when the country is unobserved one month prior.

# Show the month-years missing from the data
unbalanced %>%
  expand(country, nesting(year, month)) %>%
  anti_join(unbalanced)

# Add the implicitly missing/NA rows to the data to make them explicitly missing
rebalanced <- unbalanced %>%
  complete(country, nesting(year, month)) %>%
  arrange(country, year, month)

rebalanced

Now that we have added the implicitly missing values as explicitly missing values in our data, we can calculate the three lags we wanted and get the desired accumulated_score.

# Add lagged scores for lags of 1, 2, 3 months
# then the accumulated score with arbitrary weights
wt_lag1 <- 0.5
wt_lag2 <- 0.3
wt_lag3 <- 0.1

rebalanced %>%
  mutate(
    lag1 = lag(score, n = 1),
    lag2 = lag(score, n = 2),
    lag3 = lag(score, n = 3),
    accumulated_score = score  
      wt_lag1 * lag1  
      wt_lag2 * lag2  
      wt_lag3 * lag3
  )
  • Related