Home > Mobile >  R - Finding rows with repeated elements
R - Finding rows with repeated elements

Time:12-19

I'm having some trouble making a regex that'll match rows with repeated elements...

I have a dataframe that looks something like this:

View (df)

ID Sequence
ID_1 ATCGATTTCGAGGGCGTACG
ID_2 ATGCAGTAGCCCCATCGAGT
ID_3 ACGTAAAACGTCGAGAGAGT
ID_4 GAAGATCGTCGTCGTCGTCG
ID_5 ACTGTAGCTCGAAAGGGCCC

I'm trying to find the row that constains an element that is repeated consecutively at least 5 times. In this example, then, it would be row 4, since the desired element TCG is repeated consecutively 5 times.

But when I do:

which (grepl (x = df$Sequence, pattern = "TCG{5, }"))

It returns all 5 rows, because all of them contain TCG repeated once somewhere inside the string.

I'm trying to learn more about regex, but I'd GREATLY appreciate any help right now!!!

CodePudding user response:

You can use () to group TCG

> with(df,grep("(TCG){5,}", Sequence))
[1] 4

CodePudding user response:

With dplyr and grepl:

library(dplyr)

df <- read.table(text = "
  ID    Sequence
  ID_1  ATCGATTTCGAGGGCGTACG
  ID_2  ATGCAGTAGCCCCATCGAGT
  ID_3  ACGTAAAACGTCGAGAGAGT
  ID_4  GAAGATCGTCGTCGTCGTCG
  ID_5  ACTGTAGCTCGAAAGGGCCC
                 ", header =T)


df |> filter(grepl('(TCG){5}', Sequence))
#>     ID             Sequence
#> 1 ID_4 GAAGATCGTCGTCGTCGTCG
  • Related