I have a dataframe (df1) that has ~100 columns that starts with "i10_pr". For each row, I would like to see how many times values in the columns that start with "i10_pr" matches the values in another list (df2). I'd like to create a column sum
to depict this frequency.
My desired output is something as follows:
#Desired output
visitlink visitorder i10_pr1 i10_pr2 i10_pr3 i10_pr4 i10_pr5 sum
1 7466851 3 "BW28ZZZ" "BR30Y0Z" "BR39Y0Z" "" "" 0
2 7023336 1 "0BDC8ZX" "0BDC8ZX" "07D78ZX" "" "" 0
3 2481935 3 "5A09357" "3C1ZX8Z" "06HN33Z" "B54CZZA" "0W993ZX" 1
4 4605446 1 "5A1955Z" "0BH17EZ" "03HY32Z" "02HV33Z" "GZ58ZZZ" 3
5 7287173 2 "" "" "" "" "" 0
#df2
CODE
1 GZ58ZZZ
2 3C1ZX8Z
3 0BH17EZ
4 HZ89ZZZ
5 02HV33Z
6 HZ99ZZZ
I have included a toy code below:
df1 <- structure(list(visitlink = c(7466851, 7023336, 2481935, 4605446,
7287173), visitorder = c(3L, 1L, 3L, 1L, 2L), i10_pr1 = c("BW28ZZZ",
"0BDC8ZX", "5A09357", "5A1955Z", ""), i10_pr2 = c("BR30Y0Z",
"0BDC8ZX", "3C1ZX8Z", "0BH17EZ", ""), i10_pr3 = c("BR39Y0Z",
"07D78ZX", "06HN33Z", "03HY32Z", ""), i10_pr4 = c("", "", "B54CZZA",
"02HV33Z", ""), i10_pr5 = c("", "", "0W993ZX", "GZ58ZZZ", "")), class = "data.frame", row.names = c(NA,
-5L), groups = structure(list(visitlink = c(2481935, 4605446,
7023336, 7287173, 7466851), .rows = structure(list(3L, 4L, 2L,
5L, 1L), ptype = integer(0), class = c("vctrs_list_of", "vctrs_vctr",
"list"))), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-5L), .drop = TRUE))
df2 <- structure(list(CODE = structure(1:7, levels = c("GZ58ZZZ", "3C1ZX8Z",
"0BH17EZ", "HZ89ZZZ", "02HV33Z", "HZ99ZZZ", "XW03351"), class = "factor")), class = "data.frame", row.names = c(NA,
-7L))
CodePudding user response:
One option with dplyr
and stringr
could be:
df1 %>%
mutate(sum = rowSums(across(starts_with("i10_"),
~ str_detect(., paste(df2[["CODE"]], collapse = "|")))))
visitlink visitorder i10_pr1 i10_pr2 i10_pr3 i10_pr4 i10_pr5 sum
1 7466851 3 BW28ZZZ BR30Y0Z BR39Y0Z 0
2 7023336 1 0BDC8ZX 0BDC8ZX 07D78ZX 0
3 2481935 3 5A09357 3C1ZX8Z 06HN33Z B54CZZA 0W993ZX 1
4 4605446 1 5A1955Z 0BH17EZ 03HY32Z 02HV33Z GZ58ZZZ 3
5 7287173 2 0
CodePudding user response:
You can solve your problem as follow:
df1 %>%
mutate(sum = rowSums(sapply(.[grep("^i10", names(.))], `%in%`, df2$CODE)))
visitlink visitorder i10_pr1 i10_pr2 i10_pr3 i10_pr4 i10_pr5 sum
1 7466851 3 BW28ZZZ BR30Y0Z BR39Y0Z 0
2 7023336 1 0BDC8ZX 0BDC8ZX 07D78ZX 0
3 2481935 3 5A09357 3C1ZX8Z 06HN33Z B54CZZA 0W993ZX 1
4 4605446 1 5A1955Z 0BH17EZ 03HY32Z 02HV33Z GZ58ZZZ 3
5 7287173 2 0
CodePudding user response:
With base R:
df1$sum <- rowSums(matrix(unlist(df1[,-c(1:2)]) %in% df2$CODE, nrow(df1)))
df1
#> visitlink visitorder i10_pr1 i10_pr2 i10_pr3 i10_pr4 i10_pr5 sum
#> 1 7466851 3 BW28ZZZ BR30Y0Z BR39Y0Z 0
#> 2 7023336 1 0BDC8ZX 0BDC8ZX 07D78ZX 0
#> 3 2481935 3 5A09357 3C1ZX8Z 06HN33Z B54CZZA 0W993ZX 1
#> 4 4605446 1 5A1955Z 0BH17EZ 03HY32Z 02HV33Z GZ58ZZZ 3
#> 5 7287173 2 0