Home > Enterprise >  R: How to count number of times values in series of column match list
R: How to count number of times values in series of column match list

Time:08-14

I have a dataframe (df1) that has ~100 columns that starts with "i10_pr". For each row, I would like to see how many times values in the columns that start with "i10_pr" matches the values in another list (df2). I'd like to create a column sum to depict this frequency.

My desired output is something as follows:

#Desired output
  visitlink visitorder i10_pr1   i10_pr2   i10_pr3   i10_pr4   i10_pr5   sum
1   7466851          3 "BW28ZZZ" "BR30Y0Z" "BR39Y0Z" ""        ""        0
2   7023336          1 "0BDC8ZX" "0BDC8ZX" "07D78ZX" ""        ""        0
3   2481935          3 "5A09357" "3C1ZX8Z" "06HN33Z" "B54CZZA" "0W993ZX" 1
4   4605446          1 "5A1955Z" "0BH17EZ" "03HY32Z" "02HV33Z" "GZ58ZZZ" 3
5   7287173          2 ""        ""        ""        ""        ""        0

#df2
     CODE
1 GZ58ZZZ
2 3C1ZX8Z
3 0BH17EZ
4 HZ89ZZZ
5 02HV33Z
6 HZ99ZZZ

I have included a toy code below:

df1 <- structure(list(visitlink = c(7466851, 7023336, 2481935, 4605446, 
7287173), visitorder = c(3L, 1L, 3L, 1L, 2L), i10_pr1 = c("BW28ZZZ", 
"0BDC8ZX", "5A09357", "5A1955Z", ""), i10_pr2 = c("BR30Y0Z", 
"0BDC8ZX", "3C1ZX8Z", "0BH17EZ", ""), i10_pr3 = c("BR39Y0Z", 
"07D78ZX", "06HN33Z", "03HY32Z", ""), i10_pr4 = c("", "", "B54CZZA", 
"02HV33Z", ""), i10_pr5 = c("", "", "0W993ZX", "GZ58ZZZ", "")), class = "data.frame", row.names = c(NA, 
-5L), groups = structure(list(visitlink = c(2481935, 4605446, 
7023336, 7287173, 7466851), .rows = structure(list(3L, 4L, 2L, 
    5L, 1L), ptype = integer(0), class = c("vctrs_list_of", "vctrs_vctr", 
"list"))), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-5L), .drop = TRUE))

df2 <- structure(list(CODE = structure(1:7, levels = c("GZ58ZZZ", "3C1ZX8Z", 
"0BH17EZ", "HZ89ZZZ", "02HV33Z", "HZ99ZZZ", "XW03351"), class = "factor")), class = "data.frame", row.names = c(NA, 
-7L))

CodePudding user response:

One option with dplyr and stringr could be:

df1 %>%
 mutate(sum = rowSums(across(starts_with("i10_"),
                             ~ str_detect(., paste(df2[["CODE"]], collapse = "|")))))

  visitlink visitorder i10_pr1 i10_pr2 i10_pr3 i10_pr4 i10_pr5 sum
1   7466851          3 BW28ZZZ BR30Y0Z BR39Y0Z                   0
2   7023336          1 0BDC8ZX 0BDC8ZX 07D78ZX                   0
3   2481935          3 5A09357 3C1ZX8Z 06HN33Z B54CZZA 0W993ZX   1
4   4605446          1 5A1955Z 0BH17EZ 03HY32Z 02HV33Z GZ58ZZZ   3
5   7287173          2                                           0

CodePudding user response:

You can solve your problem as follow:

df1 %>%
  mutate(sum = rowSums(sapply(.[grep("^i10", names(.))], `%in%`, df2$CODE)))

  visitlink visitorder i10_pr1 i10_pr2 i10_pr3 i10_pr4 i10_pr5 sum
1   7466851          3 BW28ZZZ BR30Y0Z BR39Y0Z                   0
2   7023336          1 0BDC8ZX 0BDC8ZX 07D78ZX                   0
3   2481935          3 5A09357 3C1ZX8Z 06HN33Z B54CZZA 0W993ZX   1
4   4605446          1 5A1955Z 0BH17EZ 03HY32Z 02HV33Z GZ58ZZZ   3
5   7287173          2                                           0

CodePudding user response:

With base R:

df1$sum <- rowSums(matrix(unlist(df1[,-c(1:2)]) %in% df2$CODE, nrow(df1)))
df1
#>   visitlink visitorder i10_pr1 i10_pr2 i10_pr3 i10_pr4 i10_pr5 sum
#> 1   7466851          3 BW28ZZZ BR30Y0Z BR39Y0Z                   0
#> 2   7023336          1 0BDC8ZX 0BDC8ZX 07D78ZX                   0
#> 3   2481935          3 5A09357 3C1ZX8Z 06HN33Z B54CZZA 0W993ZX   1
#> 4   4605446          1 5A1955Z 0BH17EZ 03HY32Z 02HV33Z GZ58ZZZ   3
#> 5   7287173          2                                           0
  • Related