Counting Every Position Where a Pattern Appears-CodePudding

I have this data I randomly created:

set.seed(999)
col1 = sample.int(5, 200, replace = TRUE)
col2 = sample.int(5, 200, replace = TRUE)
col3 = sample.int(5, 200, replace = TRUE)
col4 = sample.int(5, 200, replace = TRUE)
col5 = sample.int(5, 200, replace = TRUE)
col6 = sample.int(5, 200, replace = TRUE)
col7 = sample.int(5, 200, replace = TRUE)
col8 = sample.int(5, 200, replace = TRUE)
col9 = sample.int(5, 200, replace = TRUE)
col10 = sample.int(5, 200, replace = TRUE)

d = data.frame(id = 1:10, seq =  c(paste(col1, collapse = ""),  paste(col2, collapse = ""),  paste(col3, collapse = ""),  paste(col4, collapse = ""),  paste(col5, collapse = ""),  paste(col6, collapse = ""),  paste(col7, collapse = ""),  paste(col8, collapse = ""),  paste(col9, collapse = ""), paste(col10, collapse = "")))

For each row, I want to record all positions when the pattern "11" appears. I found something similar in the "stringr" library:

library(stringr)
pattern = "11"
d$position_<- str_locate(d$seq, pattern = pattern)

This "kinda" worked, but it only recorded the first instance of the desired pattern. For each row, is there some function I can use that will record:

The positions at which the desired pattern appears (in each row)
The number of times at which the desired pattern appears (in each row)?

Thank you!

CodePudding user response：

Try gregexpr

gregexpr("11", d$seq)

[[1]]
[1]   4  29  96  99 144 167 191
attr(,"match.length")
[1] 2 2 2 2 2 2 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[2]]
[1]  68  75 119 123 192 198
attr(,"match.length")
[1] 2 2 2 2 2 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[3]]
[1]  20  31  59  75  82 118 165 182
attr(,"match.length")
[1] 2 2 2 2 2 2 2 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[4]]
[1]  18  25  31  42  46  61 198
attr(,"match.length")
[1] 2 2 2 2 2 2 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[5]]
 [1]  36  48  64  87 133 135 139 146 187 190
attr(,"match.length")
 [1] 2 2 2 2 2 2 2 2 2 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[6]]
[1]  17  31  86 141 195 197
attr(,"match.length")
[1] 2 2 2 2 2 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[7]]
[1]  40  78 156 179
attr(,"match.length")
[1] 2 2 2 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[8]]
[1]  20  23  64  90 119 126 129 131 196
attr(,"match.length")
[1] 2 2 2 2 2 2 2 2 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[9]]
[1]   3  10  34  80 114 122 185 195
attr(,"match.length")
[1] 2 2 2 2 2 2 2 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[10]]
[1]  92 136 149 176 180
attr(,"match.length")
[1] 2 2 2 2 2
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

CodePudding user response：

You can use str_locate_all() which returns a list with all patterns. For your case,

stringr::str_locate_all(d$seq, '11')

[[1]]
     start end
[1,]     4   5
[2,]    29  30
[3,]    96  97
[4,]    99 100
[5,]   144 145
[6,]   167 168
[7,]   191 192

[[2]]
     start end
[1,]    68  69
[2,]    75  76
[3,]   119 120
[4,]   123 124
[5,]   192 193
[6,]   198 199

...