Home > Mobile >  extract word from string and create new column in r
extract word from string and create new column in r

Time:01-22

my data looks like this:

try=data.frame("histones"= c("encode3Ren_limb_H3K27me3_E10","encode3Ren_facial_prominence_H3K27me3_E10", "encode3Ren_liver_H3K27me3_E12", "encode3Ren_neural_tube_H3K27me3_E14", "encode3Ren_neural_tube_H3K4me1_E12" ,"encode3Ren_neural_tube_H3K27me3_E11", "encode3Ren_neural_tube_H3K4me1_E15", "encode3Ren_neural_tube_H3K4me2_E13" ), "a"= c(1,2,3,4,5,6,7,8))

try
                                   histones a
1              encode3Ren_limb_H3K27me3_E10 1
2 encode3Ren_facial_prominence_H3K27me3_E10 2
3             encode3Ren_liver_H3K27me3_E12 3
4       encode3Ren_neural_tube_H3K27me3_E14 4
5        encode3Ren_neural_tube_H3K4me1_E12 5
6       encode3Ren_neural_tube_H3K27me3_E11 6
7        encode3Ren_neural_tube_H3K4me1_E15 7
8        encode3Ren_neural_tube_H3K4me2_E13 8

and I would to extract from the column "histones" only the histone mark (i.e. H3K27me3, H3K4me2), putting it in new column. I'm not able to use regular expression, so any help are very appreciated.

CodePudding user response:

Please check the str_extract from stringr

try %>% mutate(hist=str_extract(histones, '\\w\\d\\w\\d .*\\d(?=\\_)'))

Created on 2023-01-21 with reprex v2.0.2

                                   histones a     hist
1              encode3Ren_limb_H3K27me3_E10 1 H3K27me3
2 encode3Ren_facial_prominence_H3K27me3_E10 2 H3K27me3
3             encode3Ren_liver_H3K27me3_E12 3 H3K27me3
4       encode3Ren_neural_tube_H3K27me3_E14 4 H3K27me3
5        encode3Ren_neural_tube_H3K4me1_E12 5  H3K4me1
6       encode3Ren_neural_tube_H3K27me3_E11 6 H3K27me3
7        encode3Ren_neural_tube_H3K4me1_E15 7  H3K4me1
8        encode3Ren_neural_tube_H3K4me2_E13 8  H3K4me2

CodePudding user response:

A base R option using gsub

cbind(try, mod = gsub(".*_([H\\d ])|_[Ee]\\d $", "\\1", try$histones))
                                   histones a      mod
1              encode3Ren_limb_H3K27me3_E10 1 H3K27me3
2 encode3Ren_facial_prominence_H3K27me3_E10 2 H3K27me3
3             encode3Ren_liver_H3K27me3_E12 3 H3K27me3
4       encode3Ren_neural_tube_H3K27me3_E14 4 H3K27me3
5        encode3Ren_neural_tube_H3K4me1_E12 5  H3K4me1
6       encode3Ren_neural_tube_H3K27me3_E11 6 H3K27me3
7        encode3Ren_neural_tube_H3K4me1_E15 7  H3K4me1
8        encode3Ren_neural_tube_H3K4me2_E13 8  H3K4me2

CodePudding user response:

Well actually regular expressions are a good choice here:

try$mark <- str_extract(try$histones, "(?<=_)H\\d K\\d \\w ?(?=_)")

If you really can't use regex for some reason, here is an option using base R string functions:

x <- "encode3Ren_facial_prominence_H3K27me3_E10"
mark <- tail(unlist(strsplit(x, "_")), 2)[-2]
mark

[1] "H3K27me3"
  •  Tags:  
  • r
  • Related