my data looks like this:
try=data.frame("histones"= c("encode3Ren_limb_H3K27me3_E10","encode3Ren_facial_prominence_H3K27me3_E10", "encode3Ren_liver_H3K27me3_E12", "encode3Ren_neural_tube_H3K27me3_E14", "encode3Ren_neural_tube_H3K4me1_E12" ,"encode3Ren_neural_tube_H3K27me3_E11", "encode3Ren_neural_tube_H3K4me1_E15", "encode3Ren_neural_tube_H3K4me2_E13" ), "a"= c(1,2,3,4,5,6,7,8))
try
histones a
1 encode3Ren_limb_H3K27me3_E10 1
2 encode3Ren_facial_prominence_H3K27me3_E10 2
3 encode3Ren_liver_H3K27me3_E12 3
4 encode3Ren_neural_tube_H3K27me3_E14 4
5 encode3Ren_neural_tube_H3K4me1_E12 5
6 encode3Ren_neural_tube_H3K27me3_E11 6
7 encode3Ren_neural_tube_H3K4me1_E15 7
8 encode3Ren_neural_tube_H3K4me2_E13 8
and I would to extract from the column "histones" only the histone mark (i.e. H3K27me3, H3K4me2), putting it in new column. I'm not able to use regular expression, so any help are very appreciated.
CodePudding user response:
Please check the str_extract
from stringr
try %>% mutate(hist=str_extract(histones, '\\w\\d\\w\\d .*\\d(?=\\_)'))
Created on 2023-01-21 with reprex v2.0.2
histones a hist
1 encode3Ren_limb_H3K27me3_E10 1 H3K27me3
2 encode3Ren_facial_prominence_H3K27me3_E10 2 H3K27me3
3 encode3Ren_liver_H3K27me3_E12 3 H3K27me3
4 encode3Ren_neural_tube_H3K27me3_E14 4 H3K27me3
5 encode3Ren_neural_tube_H3K4me1_E12 5 H3K4me1
6 encode3Ren_neural_tube_H3K27me3_E11 6 H3K27me3
7 encode3Ren_neural_tube_H3K4me1_E15 7 H3K4me1
8 encode3Ren_neural_tube_H3K4me2_E13 8 H3K4me2
CodePudding user response:
A base R option using gsub
cbind(try, mod = gsub(".*_([H\\d ])|_[Ee]\\d $", "\\1", try$histones))
histones a mod
1 encode3Ren_limb_H3K27me3_E10 1 H3K27me3
2 encode3Ren_facial_prominence_H3K27me3_E10 2 H3K27me3
3 encode3Ren_liver_H3K27me3_E12 3 H3K27me3
4 encode3Ren_neural_tube_H3K27me3_E14 4 H3K27me3
5 encode3Ren_neural_tube_H3K4me1_E12 5 H3K4me1
6 encode3Ren_neural_tube_H3K27me3_E11 6 H3K27me3
7 encode3Ren_neural_tube_H3K4me1_E15 7 H3K4me1
8 encode3Ren_neural_tube_H3K4me2_E13 8 H3K4me2
CodePudding user response:
Well actually regular expressions are a good choice here:
try$mark <- str_extract(try$histones, "(?<=_)H\\d K\\d \\w ?(?=_)")
If you really can't use regex for some reason, here is an option using base R string functions:
x <- "encode3Ren_facial_prominence_H3K27me3_E10"
mark <- tail(unlist(strsplit(x, "_")), 2)[-2]
mark
[1] "H3K27me3"