I'm having some trouble making a regex that'll match rows with repeated elements...
I have a dataframe that looks something like this:
View (df)
ID | Sequence |
---|---|
ID_1 | ATCGATTTCGAGGGCGTACG |
ID_2 | ATGCAGTAGCCCCATCGAGT |
ID_3 | ACGTAAAACGTCGAGAGAGT |
ID_4 | GAAGATCGTCGTCGTCGTCG |
ID_5 | ACTGTAGCTCGAAAGGGCCC |
I'm trying to find the row that constains an element that is repeated consecutively at least 5 times. In this example, then, it would be row 4, since the desired element TCG is repeated consecutively 5 times.
But when I do:
which (grepl (x = df$Sequence, pattern = "TCG{5, }"))
It returns all 5 rows, because all of them contain TCG repeated once somewhere inside the string.
I'm trying to learn more about regex, but I'd GREATLY appreciate any help right now!!!
CodePudding user response:
You can use ()
to group TCG
> with(df,grep("(TCG){5,}", Sequence))
[1] 4
CodePudding user response:
With dplyr
and grepl
:
library(dplyr)
df <- read.table(text = "
ID Sequence
ID_1 ATCGATTTCGAGGGCGTACG
ID_2 ATGCAGTAGCCCCATCGAGT
ID_3 ACGTAAAACGTCGAGAGAGT
ID_4 GAAGATCGTCGTCGTCGTCG
ID_5 ACTGTAGCTCGAAAGGGCCC
", header =T)
df |> filter(grepl('(TCG){5}', Sequence))
#> ID Sequence
#> 1 ID_4 GAAGATCGTCGTCGTCGTCG