Dealing sequences identification n R-CodePudding

I have a unique sequence (pattern) that I have to detect from a data-frame this sequence is 1-2-3-4-5-6 (in that specific order) I have to be able to count how many times the sequence was interrupted (broken) and we will know it was interrupted whenever the character "X" appears.. For example if I have the sequence:

1, 2, 3, 4, X, 5, 6

it means that the sequence was broken after "4" another way to say it will be "it was broken between stages 4 and 5"

The objective is to quantify how many times the sequence was broken after each stage, that means how many times the X appeared after the character 1, how many times after character 2 and so on...

lets say we have the following dataset

sample<-c(1,2,"X",3,4,5,6,1,2,3,4,5,"X",6,1,2,3,4,"X",5,6,1,"X",2,3,4,5,6,1,2,3,4,5,6)

Then I can say that the sequence was broken (n times):

After stage 1 = 1 time

After stage 2 = 1 time

After stage 3 = 0 times

After stage 4 = 1 time

After stage 5 = 1 time

After stage 6 = 0 times

Thank you guys so much for the help I am trying to come up with a solution that will be suited for a large dateset but I am such learning if you perhaps don´t know the answer but can reference some books or blogs or documentation for some functions that will be so cool!

CodePudding user response：

Just count the occurrences:

table(sample[which(sample == "X")-1])
# 1 2 4 5 
# 1 1 1 1

Count the occurrences with 0s for other possibles:

table(c(unique(setdiff(sample, "X")), sample[which(sample == "X")-1])) - 1
# 1 2 3 4 5 6 
# 1 1 0 1 1 0

FYI, the use of which(.)-1 omit a count if the first "X" occurrence is the first in sample. Since you said you needed to know the stages after which the "X" occurs, this does not appear to be a problem. If it is, one could always preface sample with a canary value of sorts, ala

table(c("OOPS", unique(setdiff(sample, "X")), c("OOPS", sample)[which(c("OOPS", sample) == "X")-1])) - 1
#    1    2    3    4    5    6 OOPS 
#    1    1    0    1    1    0    0 

sample[1] <- "X"
table(c("OOPS", unique(setdiff(sample, "X")), c("OOPS", sample)[which(c("OOPS", sample) == "X")-1])) - 1
#    1    2    3    4    5    6 OOPS 
#    1    1    0    1    1    0    1

CodePudding user response：

We may also do with tabulate

tabulate(sample[sample %in% "X"])