Identifying, extracting and counting patterns in sequences-CodePudding

Hello lovely and nice people of SO, I'm working with a data-frame that contains only two columns one column corresponds to a Unique ID generated by a Virtual Machine and the second column contains a name but this particularly column may also contain the string "ERROR" and the objective is to create a script that will allow us to identify every time the string "ERROR" is found and capture the last and following names around it and also the unique ID assigned to the string "ERROR", to illustrate lets look at the following example:

If I have this data

ID	NAMES
1	James
3	ERROR
6	Keras
88	Kelly
53	Micheal
55	ERROR
7	Cindy
834	Keras

Then we would like to have come up with the following list:

ID	NAMES
3	James-Keras
55	Micheal-Cindy

This is because the first string "ERROR" found had an ID of 3 and was between the names James (before ERROR) and Keras (After ERROR) the next "ERROR" had an ID of 55 and was between Micheal and Cindy what if "ERROR" is a the top of the list or the bottom then we should only include whatever name we find it is ok to have lets say " NA-NAME" is ERROR was found at the top...

But here is where it gets tricky if we ever run into a sequence with consecutive strings "ERROR" we should always use as a "guide" the very last one in descending order for instance:

If I have this data set

ID	NAMES
1	James
3	ERROR
6	ERROR
88	ERROR
53	Jude
55	ERROR
7	Cindy
834	Keras

then we will want to have

ID	NAMES
88	James-Jude
55	Jude-Cindy

and this is because the string ERROR was repeated 3 times consecutively but the last one was at ID 88 so that means that we'll take that as a reference and record the names before and after it, another way of seeing this is to view the strings "ERROR" as a block so we'll record the names before and after each block of strings "ERROR"

Thank you so much to everyone that is trying to help me out I'd really appreciate if you can reference a book or functions that could help me out thank you so much.

CodePudding user response：

We may create a function to do this

f1 <- function(dat) {

    subdat1 <- subset(dat, !duplicated(with(rle(NAMES == "ERROR"), 
           rep(seq_along(values), lengths)), fromLast = TRUE))
    subdat2 <- subset(dat, !duplicated(with(rle(NAMES == "ERROR"), 
          rep(seq_along(values), lengths))))
    ind <- which(subdat1$NAMES == "ERROR")
    do.call(rbind, lapply(ind[c(TRUE, diff(ind) > 1)], function(i) 
        data.frame(ID = subdat1$ID[i],NAMES = paste(subdat1$NAMES[i-1], 
        subdat2$NAMES[i 1], sep="-"))))
}

-testing

> f1(df1)
  ID         NAMES
1  3   James-Keras
2 55 Micheal-Cindy
> f1(df2)
  ID      NAMES
1 88 James-Jude
2 55 Jude-Cindy

data

df1 <- structure(list(ID = c(1L, 3L, 6L, 88L, 53L, 55L, 7L, 834L), NAMES = c("James", 
"ERROR", "Keras", "Kelly", "Micheal", "ERROR", "Cindy", "Keras"
)), class = "data.frame", row.names = c(NA, -8L))

df2 <- structure(list(ID = c(1L, 3L, 6L, 88L, 53L, 55L, 7L, 834L), NAMES = c("James", 
"ERROR", "ERROR", "ERROR", "Jude", "ERROR", "Cindy", "Keras")), 
 class = "data.frame", row.names = c(NA, 
-8L))