Home > Blockchain >  R and dplyr: case_when throws 'incorrect length error' despite not being asked to evaluate
R and dplyr: case_when throws 'incorrect length error' despite not being asked to evaluate

Time:08-10

I have a panel dataset where some groups have observations starting at an earlier year than others and would like to calculate the change in value from the earliest possible time period. I expected that by using case_when within mutate, R would not try to evaluate the code for groups where the earlier dates do not exist, but this does not seem to be the case. I have included a reprex below.

library("dplyr")

dataset <- data.frame(names=c("a","a","a","b","b"),
                      values=c(2,3,4,2,3),
                      dates=c("2010","2011","2012","2011","2012"))

dataset_calc <- dataset %>% 
  group_by(names) %>% 
  mutate(new_val = case_when(names=="a" ~ values-values[dates=="2010"],
                             TRUE ~ values-values[dates=="2011"]))

Is there a better solution for what I would like to do? The resulting dataframe should be something like:

  names values dates new_val
1     a      2  2010       0
2     a      3  2011       1
3     a      4  2012       2
4     b      2  2011       0
5     b      3  2012       1

CodePudding user response:

If you arrage the data by group, then you can just subtract off the first value for each group

dataset %>% 
  group_by(names) %>% 
  arrange(dates) %>% 
  mutate(new_val = values - first(values))

If you wanted to hard code different reference years, you would want to use the case_when part over the year rather than the values. For example

dataset %>% 
  group_by(names) %>% 
  mutate(
    ref_year = case_when(names=="a" ~ "2010", TRUE~"2011"),
    new_val = values - values[dates==ref_year],
    ref_year = NULL
)

(you don't need to use the temporary ref_year variable, I just added it here for clarity of how the function was working)

  • Related