Home > Net >  Counting Every Position Where a Pattern Appears
Counting Every Position Where a Pattern Appears

Time:06-14

I have this data I randomly created:

set.seed(999)
col1 = sample.int(5, 200, replace = TRUE)
col2 = sample.int(5, 200, replace = TRUE)
col3 = sample.int(5, 200, replace = TRUE)
col4 = sample.int(5, 200, replace = TRUE)
col5 = sample.int(5, 200, replace = TRUE)
col6 = sample.int(5, 200, replace = TRUE)
col7 = sample.int(5, 200, replace = TRUE)
col8 = sample.int(5, 200, replace = TRUE)
col9 = sample.int(5, 200, replace = TRUE)
col10 = sample.int(5, 200, replace = TRUE)

d = data.frame(id = 1:10, seq =  c(paste(col1, collapse = ""),  paste(col2, collapse = ""),  paste(col3, collapse = ""),  paste(col4, collapse = ""),  paste(col5, collapse = ""),  paste(col6, collapse = ""),  paste(col7, collapse = ""),  paste(col8, collapse = ""),  paste(col9, collapse = ""), paste(col10, collapse = "")))

For each row, I want to record all positions when the pattern "11" appears. I found something similar in the "stringr" library:

library(stringr)
pattern = "11"
d$position_<- str_locate(d$seq, pattern = pattern)

This "kinda" worked, but it only recorded the first instance of the desired pattern. For each row, is there some function I can use that will record:

  • The positions at which the desired pattern appears (in each row)
  • The number of times at which the desired pattern appears (in each row)?

Thank you!

CodePudding user response:

Try gregexpr

gregexpr("11", d$seq)

[[1]]
[1]   4  29  96  99 144 167 191
attr(,"match.length")
[1] 2 2 2 2 2 2 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[2]]
[1]  68  75 119 123 192 198
attr(,"match.length")
[1] 2 2 2 2 2 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[3]]
[1]  20  31  59  75  82 118 165 182
attr(,"match.length")
[1] 2 2 2 2 2 2 2 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[4]]
[1]  18  25  31  42  46  61 198
attr(,"match.length")
[1] 2 2 2 2 2 2 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[5]]
 [1]  36  48  64  87 133 135 139 146 187 190
attr(,"match.length")
 [1] 2 2 2 2 2 2 2 2 2 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[6]]
[1]  17  31  86 141 195 197
attr(,"match.length")
[1] 2 2 2 2 2 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[7]]
[1]  40  78 156 179
attr(,"match.length")
[1] 2 2 2 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[8]]
[1]  20  23  64  90 119 126 129 131 196
attr(,"match.length")
[1] 2 2 2 2 2 2 2 2 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[9]]
[1]   3  10  34  80 114 122 185 195
attr(,"match.length")
[1] 2 2 2 2 2 2 2 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[10]]
[1]  92 136 149 176 180
attr(,"match.length")
[1] 2 2 2 2 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

CodePudding user response:

You can use str_locate_all() which returns a list with all patterns. For your case,

stringr::str_locate_all(d$seq, '11')

[[1]]
     start end
[1,]     4   5
[2,]    29  30
[3,]    96  97
[4,]    99 100
[5,]   144 145
[6,]   167 168
[7,]   191 192

[[2]]
     start end
[1,]    68  69
[2,]    75  76
[3,]   119 120
[4,]   123 124
[5,]   192 193
[6,]   198 199

...
  • Related