I have a df that looks like this:
AF | GT | Sample_name |
---|---|---|
0.001 | 1/1 | path/to/sample/name/ID0001.vcf.gz |
0.005 | 0/1 | path/to/sample/name/ID0002.vcf.gz |
What I want is to only keep the ID name in the Sample_name column:
AF | GT | Sample_name |
---|---|---|
0.001 | 1/1 | ID0001 |
0.005 | 0/1 | ID0002 |
I would very much appreciate any help in achieving this.
CodePudding user response:
There are some built in file name helpers that you can use here.
basename()
tools::file_path_sans_ext()
So in this example simply do:
library(tools)
df$Sample_name <- file_path_sans_ext(basename(df$Sample_name), compression = TRUE)
CodePudding user response:
You can use a regex pattern with gsub()
:
gsub(".*(ID\\d*).*", replacement = "\\1", x = "path/to/sample/name/ID0001.vcf.gz")
#> "ID0001"
Across your dataframe:
df$sample_name2 <- gsub(".*(ID\\d*).*", replacement = "\\1", x = df$sample_name)
CodePudding user response:
Here is tidyverse solution. Note this only works if you ID string has always: ID followed by 4 numbers:
library(dplyr)
library(stringr)
df %>%
mutate(Sample_name=str_extract(Sample_name, 'ID\\d{4}'))
AF GT Sample_name
1 0.001 1/1 ID0001
2 0.005 0/1 ID0002
CodePudding user response:
Using sub
with basename
to take the sample name:
df$Sample_name <- sub('\\..*$', '', basename(df$Sample_name))
df
Output:
AF GT Sample_name
1 0.001 1/1 ID0001
2 0.005 0/1 ID0002
Data
df <- data.frame(AF = c(0.001, 0.005),
GT = c("1/1", "0/1"),
Sample_name = c("path/to/sample/name/ID0001.vcf.gz", "path/to/sample/name/ID0002.vcf.gz"))