Home > Mobile >  R, compare strings and count
R, compare strings and count

Time:12-18

I have a column with short sequences. I am trying to count gaps ("-") in ref column with respect to the ip.

here is the data frame

df <- structure(list(ip = c("ATCGGGTTA", "AT--GATCT", "AT-GGATCT"), 
    ref = c("AT--GATCT", "ATCGGGTTA", "AT--GATCT"), gap = c(2L, 
    0L, 1L)), row.names = c(NA, -3L), class = "data.frame")

logic -> if ip == "A|C|T|G", and ref == "-" then gap

I tried following code:

mapply(function(x,y) if(x != '-' ) sum(y=="-"),strsplit(df$ip,""),strsplit(df$ref,""))

[1] 2 0 2

for the 3rd row, it doesn't return 1. It should only count when there is an alphabet in IP and a gap in ref at the same position.

Thanks!

CodePudding user response:

Maybe this works for you. Definitely test it on a bigger dataset before using it in production as there can be all sorts of corner cases where intuition breaks.

df
         ip       ref
1 ATCGGGTTA AT--GATCT
2 AT--GATCT ATCGGGTTA
3 AT-GGATCT AT--GATCT

df$gap <- colSums( mapply( function(x,y) 
  grepl("-",y) & grepl("-",y) != grepl("-",x) , 
  strsplit(df$ip,""), strsplit(df$ref,"")  ) )

df
         ip       ref gap
1 ATCGGGTTA AT--GATCT   2
2 AT--GATCT ATCGGGTTA   0
3 AT-GGATCT AT--GATCT   1
  •  Tags:  
  • r
  • Related