Home > other >  Create categorical variable/data subset with case matching
Create categorical variable/data subset with case matching

Time:04-07

I have a dataset like this:

structure(list(year = c(2019, 2019, 2019, 2019, 2019, 2019), 
    venue = c("Z", "Z", "Z", "Z", "O", "D"), HO = c("X", "Y", 
    "X", "Y", "W", "J"), AW = c("Y", "X", "W", "T", "T", "X")), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

I need to create a categorical variable that returns "yes" or "no".

structure(list(year = c(2019, 2019, 2019, 2019, 2019, 2019), 
    venue = c("Z", "Z", "Z", "Z", "O", "D"), HO = c("X", "Y", 
    "X", "Y", "W", "J"), AW = c("Y", "X", "W", "T", "T", "X"), 
    Cat = c("yes", "yes", "no", "no", "no", "no")), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))

As note, the first two rows are cases for "yes". These are the same in the attributes "year" and "venue", regardless of the order in "HO" and "AW".

Thanks.

CodePudding user response:

You can first group your dataframe by row using rowwise(), then paste the HO and AW columns together. This step is to see which rows have the same HO and AW regardless of order. Then group_by the unique combination to categorise them.

library(dplyr)
library(stringr)

df %>% 
  rowwise() %>% 
  mutate(paste_col = paste0(sort(str_split(paste0(HO, AW), "", simplify = T)), collapse = ",")) %>% 
  group_by(year, venue, paste_col) %>% 
  mutate(Cat = ifelse(n() > 1, "Yes", "No")) %>% 
  ungroup() %>% 
  select(-paste_col)

# A tibble: 6 × 5
   year venue HO    AW    Cat  
  <dbl> <chr> <chr> <chr> <chr>
1  2019 Z     X     Y     Yes  
2  2019 Z     Y     X     Yes  
3  2019 Z     X     W     No   
4  2019 Z     Y     T     No   
5  2019 O     W     T     No   
6  2019 D     J     X     No
  • Related