R - Finding rows with repeated elements-CodePudding

I'm having some trouble making a regex that'll match rows with repeated elements...

I have a dataframe that looks something like this:

View (df)

ID	Sequence
ID_1	ATCGATTTCGAGGGCGTACG
ID_2	ATGCAGTAGCCCCATCGAGT
ID_3	ACGTAAAACGTCGAGAGAGT
ID_4	GAAGATCGTCGTCGTCGTCG
ID_5	ACTGTAGCTCGAAAGGGCCC

I'm trying to find the row that constains an element that is repeated consecutively at least 5 times. In this example, then, it would be row 4, since the desired element TCG is repeated consecutively 5 times.

But when I do:

which (grepl (x = df$Sequence, pattern = "TCG{5, }"))

It returns all 5 rows, because all of them contain TCG repeated once somewhere inside the string.

I'm trying to learn more about regex, but I'd GREATLY appreciate any help right now!!!

CodePudding user response：

You can use () to group TCG

> with(df,grep("(TCG){5,}", Sequence))
[1] 4

CodePudding user response：

With dplyr and grepl:

library(dplyr)

df <- read.table(text = "
  ID    Sequence
  ID_1  ATCGATTTCGAGGGCGTACG
  ID_2  ATGCAGTAGCCCCATCGAGT
  ID_3  ACGTAAAACGTCGAGAGAGT
  ID_4  GAAGATCGTCGTCGTCGTCG
  ID_5  ACTGTAGCTCGAAAGGGCCC
                 ", header =T)


df |> filter(grepl('(TCG){5}', Sequence))
#>     ID             Sequence
#> 1 ID_4 GAAGATCGTCGTCGTCGTCG