Compare the similarity of character vectors by position-CodePudding

I have the following dataset:

df <- data.frame(barcode=c("B1","B2", "B3", "B4"), 
                 sequence= sapply(1:4, function(x) paste(sample(c("A","C","T","G"), 4, replace=T), collapse="")))

I want to know how similar each 'barcode' is compared to any other 'barcode' in df$barcode. That is, by position.

A complete agreement would be 100%, one position in disagreement would be 75% and so on.

Example: df$barcode contains (AATT, AATT, TATT, TATA)

the pairwise similarity matrix would be then

 B1 B2 B3 B4
B1 x 100 75 50
B2 100 x 75 50
B3 75 75 x 75
B4 50 50 75 x

even though every 'Barcode" contains 2xT and 2xA. So, the question is "how many positions have the same content between two Barcodes?" How to achieve this in R?

CodePudding user response：

Using Levenshtein (edit) distance, or rather 1-distance

> 1-adist(df$sequence)/4

     [,1] [,2] [,3] [,4]
[1,] 1.00 0.75 0.25 0.25
[2,] 0.75 1.00 0.00 0.25
[3,] 0.25 0.00 1.00 0.50
[4,] 0.25 0.25 0.50 1.00

(assuming all lengths equal to 4).

Edit: I misunderstood your problem. Levenshtein distance finds maximal matching, so reordering the strings if necessary. You want an exact word for word matching, in that case...

sapply(df$sequence,function(x){
  sapply(df$sequence,function(y){
    sum(strsplit(x,"")[[1]]==strsplit(y,"")[[1]])
  })
})/4

     ACAC AGAC CCTT CGCT
ACAC 1.00 0.75 0.25 0.00
AGAC 0.75 1.00 0.00 0.25
CCTT 0.25 0.00 1.00 0.50
CGCT 0.00 0.25 0.50 1.00

or for the other vector provided in the comments

sapply(df$sequence,function(x){
  sapply(df$sequence,function(y){
    sum(strsplit(x,"")[[1]]==strsplit(y,"")[[1]])
  })
})/4

     GACC AAAC ACAC GCCA
GACC 1.00 0.50 0.25 0.50
AAAC 0.50 1.00 0.75 0.00
ACAC 0.25 0.75 1.00 0.25
GCCA 0.50 0.00 0.25 1.00