Home > database >  Removing text inside the bracket from specific rows in dataframe
Removing text inside the bracket from specific rows in dataframe

Time:09-24

So this is my sample dataframe

dput(aa)
structure(list(V4 = structure(1:22, .Label = c("Peak228404", 
"Peak228411", "Peak228413", "Peak228423", "Peak228424", "Peak228439", 
"Peak228461", "Peak228476", "Peak228479", "Peak228495", "Peak228528", 
"Peak228553", "Peak228603", "Peak228612", "Peak228629", "Peak228630", 
"Peak228642", "Peak228651", "Peak228691", "Peak228740", "Peak4983", 
"Peak5261"), class = "factor"), annotation = structure(c(1L, 
4L, 5L, 1L, 1L, 1L, 6L, 8L, 1L, 1L, 1L, 1L, 1L, 1L, 8L, 8L, 8L, 
8L, 7L, 8L, 2L, 3L), .Label = c("Distal Intergenic", "Downstream (1-2kb)", 
"Downstream (2-3kb)", "Exon (ENST00000370460.6/2334, exon 16 of 21)", 
"Exon (ENST00000370460.6/2334, exon 21 of 21)", "Exon (ENST00000616857.4/84548, exon 3 of 3)", 
"Exon (ENST00000620118.4/ENST00000620118.4, exon 3 of 4)", "Promoter"
), class = "factor"), Output_required = structure(c(1L, 5L, 5L, 
1L, 1L, 1L, 5L, 6L, 1L, 1L, 1L, 1L, 1L, 1L, 6L, 6L, 6L, 6L, 4L, 
6L, 2L, 3L), .Label = c("Distal Intergenic", "Downstream (1-2kb)", 
"Downstream (2-3kb)", "Exon", "Exon ", "Promoter"), class = "factor")), class = "data.frame", row.names = c(NA, 
-22L))

This

 V4                                              annotation    Output_required
1  Peak228404                                       Distal Intergenic  Distal Intergenic
2  Peak228411            Exon (ENST00000370460.6/2334, exon 16 of 21)              Exon 
3  Peak228413            Exon (ENST00000370460.6/2334, exon 21 of 21)              Exon 
4  Peak228423                                       Distal Intergenic  Distal Intergenic
5  Peak228424                                       Distal Intergenic  Distal Intergenic
6  Peak228439                                       Distal Intergenic  Distal Intergenic
7  Peak228461             Exon (ENST00000616857.4/84548, exon 3 of 3)              Exon 
8  Peak228476                                                Promoter           Promoter
9  Peak228479                                       Distal Intergenic  Distal Intergenic
10 Peak228495                                       Distal Intergenic  Distal Intergenic
11 Peak228528                                       Distal Intergenic  Distal Intergenic
12 Peak228553                                       Distal Intergenic  Distal Intergenic
13 Peak228603                                       Distal Intergenic  Distal Intergenic
14 Peak228612                                       Distal Intergenic  Distal Intergenic
15 Peak228629                                                Promoter           Promoter
16 Peak228630                                                Promoter           Promoter
17 Peak228642                                                Promoter           Promoter
18 Peak228651                                                Promoter           Promoter
19 Peak228691 Exon (ENST00000620118.4/ENST00000620118.4, exon 3 of 4)               Exon
20 Peak228740                                                Promoter           Promoter
21   Peak4983                                      Downstream (1-2kb) Downstream (1-2kb)
22   Peak5261                                      Downstream (2-3kb) Downstream (2-3kb)

So in this data-frame the column called annotation there are row which contains the string Exon so each there is text inside the bracket which i don't want as i want to keep it consistent which is Exononly. I have added another column Output_requiredwhich is my desired final output.

Any suggestion or help would be really appreciated.

CodePudding user response:

Remove everything after 'Exon' can be written with the help of lookbehind regex.

sub('(?<=Exon).*', '', aa$annotation, perl = TRUE)

# [1] "Distal Intergenic"  "Exon"               "Exon"               "Distal Intergenic" 
# [5] "Distal Intergenic"  "Distal Intergenic"  "Exon"               "Promoter"          
# [9] "Distal Intergenic"  "Distal Intergenic"  "Distal Intergenic"  "Distal Intergenic" 
#[13] "Distal Intergenic"  "Distal Intergenic"  "Promoter"           "Promoter"          
#[17] "Promoter"           "Promoter"           "Exon"               "Promoter"          
#[21] "Downstream (1-2kb)" "Downstream (2-3kb)"

Similarly, stringr::str_remove can be used as well.

stringr::str_remove(aa$annotation, '(?<=Exon).*')

CodePudding user response:

Another way to achieve your goal is by using backreference:

sub("(Exon)(.*)", "\\1", aa$annotation)

Here we partition the strings into two capturing groups:

  • (Exon): this group literally captures Exon
  • (.*): this group captures everything else
  • \\1: ths backreference, used in the replacement argument to sub, "recollects" the first capture group - but not the second, thereby effectively removing it!
  • Related