Editing each row in column in R-CodePudding

I have a data frame that looks like this:

Twin_Pair           zyg CDsumTwin1 CDsumTwin2
   <chr>             <int>      <dbl>      <dbl>
 1 pair1(2891,2892)      2          0          5
 2 pair2(4000,4001)      1          0          0
 3 pair3(4006,4007)      2          0          3
 4 pair4(4009,4010)      2          1          3
 5 pair5(4012,4013)      2          2          0
 6 pair6(4015,4016)      2          0          9
 7 pair7(4018,4019)      2          0          0
 8 pair8(4021,4022)      1          0          0
 9 pair9(4024,4025)      1          0          0
10 pair10(4027,4028)     2          2         17

How can I remove "pair1", "pair2", etc. from each row in the first column such that I am left with something like (4027,4028)? I know how to remove the first 5 characters, but the problem is goes up to pair100. What would be an efficient way to do this?

CodePudding user response：

You need a regex call to identify your pattern. Please test this code to see if it works.

dat$Twin_Pair <- sub("^pair[0-9] ", "", dat$Twin_Pair)
dat
#      Twin_Pair zyg CDsumTwin1 CDsumTwin2
# 1  (2891,2892)   2          0          5
# 2  (4000,4001)   1          0          0
# 3  (4006,4007)   2          0          3
# 4  (4009,4010)   2          1          3
# 5  (4012,4013)   2          2          0
# 6  (4015,4016)   2          0          9
# 7  (4018,4019)   2          0          0
# 8  (4021,4022)   1          0          0
# 9  (4024,4025)   1          0          0
# 10 (4027,4028)   2          2         17

Data

dat <- read.table(text = "Twin_Pair           zyg CDsumTwin1 CDsumTwin2
 1 'pair1(2891,2892)'      2          0          5
 2 'pair2(4000,4001)'      1          0          0
 3 'pair3(4006,4007)'      2          0          3
 4 'pair4(4009,4010)'      2          1          3
 5 'pair5(4012,4013)'      2          2          0
 6 'pair6(4015,4016)'      2          0          9
 7 'pair7(4018,4019)'      2          0          0
 8 'pair8(4021,4022)'      1          0          0
 9 'pair9(4024,4025)'      1          0          0
10 'pair10(4027,4028)'     2          2         17",
                  header = TRUE)

CodePudding user response：

An option with trimws

dat$Twin_Pair <- trimws(dat$Twin_Pair, whitespace = "[^(] ", which = 'left')

-output

> dat
     Twin_Pair zyg CDsumTwin1 CDsumTwin2
1  (2891,2892)   2          0          5
2  (4000,4001)   1          0          0
3  (4006,4007)   2          0          3
4  (4009,4010)   2          1          3
5  (4012,4013)   2          2          0
6  (4015,4016)   2          0          9
7  (4018,4019)   2          0          0
8  (4021,4022)   1          0          0
9  (4024,4025)   1          0          0
10 (4027,4028)   2          2         17

CodePudding user response：

We could use str_extract with regex '\(.*?\)', that basically extracts everything between parenthesis:

library(stringr)
library(dplyr)

dat %>% 
  mutate(Twin_Pair = str_extract(Twin_Pair, '\\(.*?\\)'))

     Twin_Pair zyg CDsumTwin1 CDsumTwin2
1  (2891,2892)   2          0          5
2  (4000,4001)   1          0          0
3  (4006,4007)   2          0          3
4  (4009,4010)   2          1          3
5  (4012,4013)   2          2          0
6  (4015,4016)   2          0          9
7  (4018,4019)   2          0          0
8  (4021,4022)   1          0          0
9  (4024,4025)   1          0          0
10 (4027,4028)   2          2         17