I have a column with short sequences. I am trying to count gaps ("-") in ref column with respect to the ip.
here is the data frame
df <- structure(list(ip = c("ATCGGGTTA", "AT--GATCT", "AT-GGATCT"),
ref = c("AT--GATCT", "ATCGGGTTA", "AT--GATCT"), gap = c(2L,
0L, 1L)), row.names = c(NA, -3L), class = "data.frame")
logic -> if ip == "A|C|T|G", and ref == "-" then gap
I tried following code:
mapply(function(x,y) if(x != '-' ) sum(y=="-"),strsplit(df$ip,""),strsplit(df$ref,""))
[1] 2 0 2
for the 3rd row, it doesn't return 1. It should only count when there is an alphabet in IP and a gap in ref at the same position.
Thanks!
CodePudding user response:
Maybe this works for you. Definitely test it on a bigger dataset before using it in production as there can be all sorts of corner cases where intuition breaks.
df
ip ref
1 ATCGGGTTA AT--GATCT
2 AT--GATCT ATCGGGTTA
3 AT-GGATCT AT--GATCT
df$gap <- colSums( mapply( function(x,y)
grepl("-",y) & grepl("-",y) != grepl("-",x) ,
strsplit(df$ip,""), strsplit(df$ref,"") ) )
df
ip ref gap
1 ATCGGGTTA AT--GATCT 2
2 AT--GATCT ATCGGGTTA 0
3 AT-GGATCT AT--GATCT 1