Home > Software design >  Extract part of string starts from ~ for all rows in specific column in data frame
Extract part of string starts from ~ for all rows in specific column in data frame

Time:09-19

I have the following data set:

          Category                                             Term Count
1 GOTERM_MF_DIRECT                      GO:0005102~receptor binding     3
2 GOTERM_MF_DIRECT GO:0008139~nuclear localization sequence binding     2
3 GOTERM_CC_DIRECT        GO:0016021~integral component of membrane     9
4 GOTERM_CC_DIRECT                         GO:0071564~npBAF complex     3

I want to keep the MF/CC in the first column and extract the string starting from "~" (to exclude GO:001..) in the third column. I can do it using loops but is there an elegant way to achieve what I need? Thanks in advance!

CodePudding user response:

You could do

library(dplyr)

df %>% 
  mutate(Category = substr(Category, 8,9), 
         Term = stringr::str_remove(Term, "(.*?)~"))

Output:

  Category Term                                  Count
  <chr>    <chr>                                 <dbl>
1 MF       receptor binding                          3
2 MF       nuclear localization sequence binding     2
3 CC       integral component of membrane            9
4 CC       npBAF complex                             3

Data:

df <- tibble::tribble(
  ~Category, ~Term, ~Count,
  "GOTERM_MF_DIRECT", "GO:0005102~receptor binding", 3,
  "GOTERM_MF_DIRECT", "GO:0008139~nuclear localization sequence binding", 2,
  "GOTERM_CC_DIRECT", "GO:0016021~integral component of membrane", 9,
  "GOTERM_CC_DIRECT", "GO:0071564~npBAF complex", 3
)
  • Related