I am trying to find the identical strings across as many columns and combinations as possible. for instance, I have a data like this
df<-structure(list(first = c("SNTM1", "STTTT2", "STOLA", "STOMQ",
"STR2", "SUPTY1", "TBNHSG", "TEYAH", "TMEIL1", "TMEIL2", "TMEIL3",
"TNIL", "TREUK", "TTRK", "TRRFK", "UBA52", "YIPF1", NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA), second = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, "SNTLK", "STTTFSG", "STOIU", "STOMQ", "STR25",
"SUPYHGS", "TBHYDG", "TEHDYG", "TMEIL1", "YIPF1", NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA), second2 = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, "SNTLKM", "STTTFSGTT", "GFD", "STOMQ",
"TRS", "BRsts", "TMHS", "RSEST", "TRSF", "YIPF1")), class = "data.frame", row.names = c(NA,
-37L))
it has 3 columns, I want to find what is similar between column 1 and column 2 . then 2 and 3 and then 1,2,3 together . SO the answer is like this
C1-C2 C2-C3 C1-C3 C1-C2-C3
STOMQ STOMQ STOMQ STOMQ
TMEIL1 YIPF1 YIPF1 YIPF1
YIPF1
which means C1(column1)-C2(column 2) share the only following identical strings
STOMQ
TMEIL1
YIPF1
the same for other columns
CodePudding user response:
a <- combn(unname(df),2, do.call, what=intersect, simplify=FALSE)
a
above contains the intersections of 1,2 and 1,3 and 2,3. Now to add the intersection of 1,2,3 to the list we do the below command: this add the intersection of 1,2,3 to the list a
c(a, list(intersect(a[[1]],a[[2]])))
[[1]]
[1] "STOMQ" "TMEIL1" "YIPF1" NA
[[2]]
[1] "STOMQ" "YIPF1" NA
[[3]]
[1] NA "STOMQ" "YIPF1"
[[4]]
[1] "STOMQ" "YIPF1" NA
CodePudding user response:
You can use accumulate()
from the purrr package as well as intersect()
from base R to accomplish this. Something like:
library(purrr)
df <- map(df, ~ discard(.x, is.na))
# first remove NA values so they don't show up in intersect results
accumulate(df, ~ base::intersect(.x, .y))
# output
List of 3
$first
"SNTM1" "STTTT2" "STOLA" "STOMQ" "STR2" "SUPTY1"
"TBNHSG" "TEYAH" "TMEIL1" "TMEIL2"
"TMEIL3" "TNIL" "TREUK" "TTRK" "TRRFK" "UBA52" "YIPF1"
$second
"STOMQ" "TMEIL1" "YIPF1"
$second2
"STOMQ" "YIPF1"
$second is the result of taking the intersection of the first and second columns and corresponds to column C1-C2 in your example above. $second2 is the result of taking the intersection of this result and second2, which corresponds to C1-C2-C3 above.