Home > Blockchain >  how to find identical string in various columns of a data frame
how to find identical string in various columns of a data frame

Time:05-19

I am trying to find the identical strings across as many columns and combinations as possible. for instance, I have a data like this

df<-structure(list(first = c("SNTM1", "STTTT2", "STOLA", "STOMQ", 
"STR2", "SUPTY1", "TBNHSG", "TEYAH", "TMEIL1", "TMEIL2", "TMEIL3", 
"TNIL", "TREUK", "TTRK", "TRRFK", "UBA52", "YIPF1", NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA), second = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, "SNTLK", "STTTFSG", "STOIU", "STOMQ", "STR25", 
"SUPYHGS", "TBHYDG", "TEHDYG", "TMEIL1", "YIPF1", NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA), second2 = c(NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, "SNTLKM", "STTTFSGTT", "GFD", "STOMQ", 
"TRS", "BRsts", "TMHS", "RSEST", "TRSF", "YIPF1")), class = "data.frame", row.names = c(NA, 
-37L))

it has 3 columns, I want to find what is similar between column 1 and column 2 . then 2 and 3 and then 1,2,3 together . SO the answer is like this

C1-C2   C2-C3 C1-C3   C1-C2-C3
STOMQ   STOMQ   STOMQ STOMQ
TMEIL1  YIPF1   YIPF1 YIPF1
YIPF1   

which means C1(column1)-C2(column 2) share the only following identical strings

 STOMQ
TMEIL1
YIPF1

the same for other columns

CodePudding user response:

a <- combn(unname(df),2, do.call, what=intersect, simplify=FALSE)

a above contains the intersections of 1,2 and 1,3 and 2,3. Now to add the intersection of 1,2,3 to the list we do the below command: this add the intersection of 1,2,3 to the list a

c(a, list(intersect(a[[1]],a[[2]])))


[[1]]
[1] "STOMQ"  "TMEIL1" "YIPF1"  NA      

[[2]]
[1] "STOMQ" "YIPF1" NA     

[[3]]
[1] NA      "STOMQ" "YIPF1"

[[4]]
[1] "STOMQ" "YIPF1" NA     

CodePudding user response:

You can use accumulate() from the purrr package as well as intersect() from base R to accomplish this. Something like:

library(purrr)

df <- map(df, ~ discard(.x, is.na)) 
# first remove NA values so they don't show up in intersect results

accumulate(df, ~ base::intersect(.x, .y))

# output

List of 3
 
  $first
    "SNTM1"  "STTTT2" "STOLA"  "STOMQ"  "STR2"   "SUPTY1" 
    "TBNHSG" "TEYAH"  "TMEIL1" "TMEIL2"
    "TMEIL3" "TNIL"   "TREUK"  "TTRK"   "TRRFK"  "UBA52"  "YIPF1" 

  $second
    "STOMQ"  "TMEIL1" "YIPF1" 

  $second2
    "STOMQ" "YIPF1"

$second is the result of taking the intersection of the first and second columns and corresponds to column C1-C2 in your example above. $second2 is the result of taking the intersection of this result and second2, which corresponds to C1-C2-C3 above.

  •  Tags:  
  • r
  • Related