I am trying to create new columns using the information in an existing column:
eg. the column 'name' contains the following value: 0112200015-1_R2_001.fastq.gz
. From this I would like to generate a column 'sample_id' containing 0112200015
(first 10 digits), a column 'timepoint' containing 1
(from -1) and a column 'paired_end' containing 2
(from R2)
What would the correct code for this be?
CodePudding user response:
I assume you want to create a new data frame with this information.
I created a vector with values similar to your column names, but you sould be using the colnames
output
vector <- c("1234-1_R2_001.fastq.gz", "5678-1_R2_001.fastq.gz", "1928-1_R2_001.fastq.gz")
df <- data.frame(sample_id = str_replace(vector, "-.*$", ""),
timepoint = str_extract(vector, "(?<=-)."),
paired_end = str_extract(vector, "(?<=R)."))
all the str
functions are from the stringr
package.
CodePudding user response:
This should give you the correct answer using dplyr
and stringr
in a tidy way. It is based on the assumption that the timepoint and paired_end always consist of one digit. If this is not the case, the small adjustment of replacing "\\d{1}"
by "\\d "
returns one or multiple digits, depending on the actual value.
library(dplyr)
library(stringr)
df <-
tibble(name = "0112200015-1_R2_001.fastq.gz")
df %>%
# Extract the 10 digit sample id
mutate(sample_id = str_extract(name, pattern = "\\d{10}"),
# Extract the 1 digit timepoint which comes after "-" and before the first "_"
timepoint = str_extract(name, pattern = "(?<=-)\\d{1}(?=_)"),
# Extract the 1 digit paired_end which comes after "_R"
paired_end = str_extract(name, pattern = "(?<=_R)\\d{1}"))
# A tibble: 1 x 4
name sample_id timepoint paired_end
<chr> <chr> <chr> <chr>
1 0112200015-1_R2_001.fastq.gz 0112200015 1 2
CodePudding user response:
tidyr::extract
You can use extract
from tidyr
package.
library(tidyr)
df %>%
extract(name, c("sample_id", "timepoint", "paired_end"),
regex = "^(\\d{10})-(\\d)_R(\\d)")
#> sample_id timepoint paired_end
#> 1 0112200015 1 2
where df
is:
df <- data.frame(name = "0112200015-1_R2_001.fastq.gz")
To make the solution more tailored to your needs, you should provide more examples, so to handle rare cases and exceptions.
A few regex can work for you. This one for example extracts the first 3 numbers it finds between non-numeric separators:
df %>%
extract(name, c("sample_id", "timepoint", "paired_end"),
regex = "^(\\d )\\D (\\d )\\D (\\d )")
#> sample_id timepoint paired_end
#> 1 0112200015 1 2