I'm very new to the regex world and would like to know how to extract strings using regex from a bunch of file names I've imported to R. My files follow the general format of:
testing1_010000.csv
check3_012000.csv
testing_checking_045880.csv
test_check2_350000.csv
And I'd like to extract everything before the 6 numbers.csv part, including the "_" to get something like:
testing1_
check3_
testing_checking_
test_check2_
If it helps, the pattern I essentially want to remove will always be 6 numbers immediately followed by .csv.
Any help would be great, thank you!
CodePudding user response:
There's a few ways you could go about this. For example, match anything before a string of six digits followed by ".csv". For this one you would want to get the first capturing group.
/(.*)\d{6}.csv/
https://regex101.com/r/MPH6mE/1/
Or match everything up to the last underscore character. For this one you would want the whole match.
.*_
https://regex101.com/r/4GFPIA/1
CodePudding user response:
Files = c("testing1_010000.csv", "check3_012000.csv",
"testing_checking_045880.csv", "test_check2_350000.csv")
sub("(.*_)[[:digit:]]{6}.*", "\\1", Files)
[1] "testing1_" "check3_" "testing_checking_"
[4] "test_check2_"
CodePudding user response:
Using nchar
:
Files = c("testing1_010000.csv", "check3_012000.csv",
"testing_checking_045880.csv", "test_check2_350000.csv")
substr(Files, 1, nchar(Files)-10)
[1] "testing1_" "check3_" "testing_checking_"
[4] "test_check2_"
CodePudding user response:
We can use stringr::str_match()
. It will also work for different that six digits.
library(tidyverse)
files <- c("testing1_010000.csv", "check3_012000.csv", "testing_checking_045880.csv", "test_check2_350000.csv")
str_match(files, '(.*_)\\d \\.csv$')[, 2]
#> [1] "testing1_" "check3_" "testing_checking_"
#> [4] "test_check2_"
The regex can be interpreted as: "capture everything before and including an underscore, that is then followed by one or more digits .csv as an ending"
Created on 2021-12-03 by the reprex package (v2.0.1)