Home > Software engineering >  Creating new columns in R using parts of an existing column
Creating new columns in R using parts of an existing column

Time:11-18

I am trying to create new columns using the information in an existing column:

eg. the column 'name' contains the following value: 0112200015-1_R2_001.fastq.gz. From this I would like to generate a column 'sample_id' containing 0112200015 (first 10 digits), a column 'timepoint' containing 1 (from -1) and a column 'paired_end' containing 2 (from R2)

What would the correct code for this be?

CodePudding user response:

I assume you want to create a new data frame with this information.

I created a vector with values similar to your column names, but you sould be using the colnames output

vector <- c("1234-1_R2_001.fastq.gz", "5678-1_R2_001.fastq.gz", "1928-1_R2_001.fastq.gz")

 df <- data.frame(sample_id = str_replace(vector, "-.*$", ""), 
                  timepoint = str_extract(vector, "(?<=-)."), 
                  paired_end = str_extract(vector, "(?<=R)."))

all the str functions are from the stringr package.

CodePudding user response:

This should give you the correct answer using dplyr and stringr in a tidy way. It is based on the assumption that the timepoint and paired_end always consist of one digit. If this is not the case, the small adjustment of replacing "\\d{1}" by "\\d " returns one or multiple digits, depending on the actual value.

library(dplyr)
library(stringr)

df <- 
  tibble(name = "0112200015-1_R2_001.fastq.gz")

df %>% 
         # Extract the 10 digit sample id
  mutate(sample_id = str_extract(name, pattern = "\\d{10}"),
         # Extract the 1 digit timepoint which comes after "-" and before the first "_"
         timepoint = str_extract(name, pattern = "(?<=-)\\d{1}(?=_)"),
         # Extract the 1 digit paired_end which comes after "_R"
         paired_end = str_extract(name, pattern = "(?<=_R)\\d{1}"))

# A tibble: 1 x 4
  name                         sample_id  timepoint paired_end
  <chr>                        <chr>      <chr>     <chr>     
1 0112200015-1_R2_001.fastq.gz 0112200015 1         2    

CodePudding user response:

tidyr::extract

You can use extract from tidyr package.

library(tidyr)

df %>%
  extract(name, c("sample_id", "timepoint", "paired_end"),
          regex = "^(\\d{10})-(\\d)_R(\\d)")

#>    sample_id timepoint paired_end
#> 1 0112200015         1          2

where df is:

df <- data.frame(name = "0112200015-1_R2_001.fastq.gz")

To make the solution more tailored to your needs, you should provide more examples, so to handle rare cases and exceptions.

A few regex can work for you. This one for example extracts the first 3 numbers it finds between non-numeric separators:

df %>%
  extract(name, c("sample_id", "timepoint", "paired_end"),
          regex = "^(\\d )\\D (\\d )\\D (\\d )")

#>    sample_id timepoint paired_end
#> 1 0112200015         1          2
  • Related