Home > Software engineering >  Using two dataframes to match values with `dplyr::case_when()`
Using two dataframes to match values with `dplyr::case_when()`

Time:09-22

I'm trying to use a separate dataframe to match values to a focal dataframe. I can easily do this with a for loop but trying to do it all in a dplyr::case_when() to be consistent with the rest of my code.

Here are the two dataframes I'm using:

df_1 = data.frame(
  year = rep(c(2001:2020), 5),
  area = c(rep("8", 20), rep("9", 20), rep("10", 20), rep("11", 20), rep("12", 20))
)

df_2 = data.frame(
  val = rnorm(20, mean = 10, sd = 0.5),
  year = rep(c(2001:2020))
)

I want to make a new column in df_1 where for all areas and years less than 2002, the value is zero, for all years in areas != 12, the value is zero, and for >2002 in area 12, the value comes from df_2.

Here's my current solution:

# match up the years 
df_1_some_matched = df_1 %>% 
  dplyr::mutate(
    new = ifelse(
      area == "12" & year < 2002, 
      0,
      ifelse(
        area != 12, 
        0,
        NA
      )
    )
  )

# now to go through the rest: 
for(yr in 2002:2020) {
  
  # get the subset 
  df_1[which(df_1$area == "12" & df_1$year == yr), "new"] = 
    # find the value from the other dataset
    df_2[which(df_2$year == yr, "val"]
}

My attempt:

## attempt in case_when()
df_1_case_matched = df_1 %>% 
  dplyr::mutate(
    new = dplyr::case_when(
      area == "12" & year < 2002    ~ 0,
      area != 12                    ~ 0,
      #### THIS IS THE LINE IM MISSING
      area == 12 & year == ????????????????
    )
  )

CodePudding user response:

This can actually be accomplished by merging a dplyr::case_when() with a dplyr::left_join() and then rbind()ing the result. However, there might be a better more concise dplyr option, so I'd love to hear others' thoughts

rbind(
  df_1 %>% 
    dplyr::filter((area != "12") | (area == "12" & year < 2002)) %>% 
    dplyr::mutate(
      dplyr::case_when(
        area == "12" & year < 2002  ~ 0,
        area != 12                  ~ 0
      )
    ),
  dplyr::left_join(
    df_1 %>% 
      filter(area == "12" & year > 2002), 
    df_2,
    by = c("year")
  )
)

CodePudding user response:

I'd go with something like this:

df_1 %>% left_join(df_2, by='year') %>%
mutate(new = if_else(area == 12 & year > 2002,val, 0)) %>% select(-val)

CodePudding user response:

Easy

df_1 %>% left_join(df_2, by='year') %>% mutate(new = if_else(area == 12 & year > 2002,val, 0)) %>% select("your val")

  • Related