So this is my sample dataframe
dput(aa)
structure(list(V4 = structure(1:22, .Label = c("Peak228404",
"Peak228411", "Peak228413", "Peak228423", "Peak228424", "Peak228439",
"Peak228461", "Peak228476", "Peak228479", "Peak228495", "Peak228528",
"Peak228553", "Peak228603", "Peak228612", "Peak228629", "Peak228630",
"Peak228642", "Peak228651", "Peak228691", "Peak228740", "Peak4983",
"Peak5261"), class = "factor"), annotation = structure(c(1L,
4L, 5L, 1L, 1L, 1L, 6L, 8L, 1L, 1L, 1L, 1L, 1L, 1L, 8L, 8L, 8L,
8L, 7L, 8L, 2L, 3L), .Label = c("Distal Intergenic", "Downstream (1-2kb)",
"Downstream (2-3kb)", "Exon (ENST00000370460.6/2334, exon 16 of 21)",
"Exon (ENST00000370460.6/2334, exon 21 of 21)", "Exon (ENST00000616857.4/84548, exon 3 of 3)",
"Exon (ENST00000620118.4/ENST00000620118.4, exon 3 of 4)", "Promoter"
), class = "factor"), Output_required = structure(c(1L, 5L, 5L,
1L, 1L, 1L, 5L, 6L, 1L, 1L, 1L, 1L, 1L, 1L, 6L, 6L, 6L, 6L, 4L,
6L, 2L, 3L), .Label = c("Distal Intergenic", "Downstream (1-2kb)",
"Downstream (2-3kb)", "Exon", "Exon ", "Promoter"), class = "factor")), class = "data.frame", row.names = c(NA,
-22L))
This
V4 annotation Output_required
1 Peak228404 Distal Intergenic Distal Intergenic
2 Peak228411 Exon (ENST00000370460.6/2334, exon 16 of 21) Exon
3 Peak228413 Exon (ENST00000370460.6/2334, exon 21 of 21) Exon
4 Peak228423 Distal Intergenic Distal Intergenic
5 Peak228424 Distal Intergenic Distal Intergenic
6 Peak228439 Distal Intergenic Distal Intergenic
7 Peak228461 Exon (ENST00000616857.4/84548, exon 3 of 3) Exon
8 Peak228476 Promoter Promoter
9 Peak228479 Distal Intergenic Distal Intergenic
10 Peak228495 Distal Intergenic Distal Intergenic
11 Peak228528 Distal Intergenic Distal Intergenic
12 Peak228553 Distal Intergenic Distal Intergenic
13 Peak228603 Distal Intergenic Distal Intergenic
14 Peak228612 Distal Intergenic Distal Intergenic
15 Peak228629 Promoter Promoter
16 Peak228630 Promoter Promoter
17 Peak228642 Promoter Promoter
18 Peak228651 Promoter Promoter
19 Peak228691 Exon (ENST00000620118.4/ENST00000620118.4, exon 3 of 4) Exon
20 Peak228740 Promoter Promoter
21 Peak4983 Downstream (1-2kb) Downstream (1-2kb)
22 Peak5261 Downstream (2-3kb) Downstream (2-3kb)
So in this data-frame the column called annotation there are row which contains the string Exon so each there is text inside the bracket which i don't want as i want to keep it consistent which is Exon
only. I have added another column Output_required
which is my desired final output.
Any suggestion or help would be really appreciated.
CodePudding user response:
Remove everything after 'Exon'
can be written with the help of lookbehind regex.
sub('(?<=Exon).*', '', aa$annotation, perl = TRUE)
# [1] "Distal Intergenic" "Exon" "Exon" "Distal Intergenic"
# [5] "Distal Intergenic" "Distal Intergenic" "Exon" "Promoter"
# [9] "Distal Intergenic" "Distal Intergenic" "Distal Intergenic" "Distal Intergenic"
#[13] "Distal Intergenic" "Distal Intergenic" "Promoter" "Promoter"
#[17] "Promoter" "Promoter" "Exon" "Promoter"
#[21] "Downstream (1-2kb)" "Downstream (2-3kb)"
Similarly, stringr::str_remove
can be used as well.
stringr::str_remove(aa$annotation, '(?<=Exon).*')
CodePudding user response:
Another way to achieve your goal is by using backreference:
sub("(Exon)(.*)", "\\1", aa$annotation)
Here we partition the strings into two capturing groups:
(Exon)
: this group literally capturesExon
(.*)
: this group captures everything else\\1
: ths backreference, used in the replacement argument tosub
, "recollects" the first capture group - but not the second, thereby effectively removing it!