Regex for extracting string from csv before numbers-CodePudding

I'm very new to the regex world and would like to know how to extract strings using regex from a bunch of file names I've imported to R. My files follow the general format of:

testing1_010000.csv
check3_012000.csv
testing_checking_045880.csv
test_check2_350000.csv

And I'd like to extract everything before the 6 numbers.csv part, including the "_" to get something like:

testing1_
check3_
testing_checking_
test_check2_

If it helps, the pattern I essentially want to remove will always be 6 numbers immediately followed by .csv.

Any help would be great, thank you!

CodePudding user response：

There's a few ways you could go about this. For example, match anything before a string of six digits followed by ".csv". For this one you would want to get the first capturing group.

/(.*)\d{6}.csv/

https://regex101.com/r/MPH6mE/1/

Or match everything up to the last underscore character. For this one you would want the whole match.

.*_

https://regex101.com/r/4GFPIA/1

CodePudding user response：

Files = c("testing1_010000.csv", "check3_012000.csv",
    "testing_checking_045880.csv", "test_check2_350000.csv")
sub("(.*_)[[:digit:]]{6}.*", "\\1", Files)

 
[1] "testing1_"         "check3_"           "testing_checking_"
[4] "test_check2_"

CodePudding user response：

Using nchar:

Files = c("testing1_010000.csv", "check3_012000.csv",
          "testing_checking_045880.csv", "test_check2_350000.csv")

substr(Files, 1, nchar(Files)-10)

[1] "testing1_"         "check3_"           "testing_checking_"
[4] "test_check2_"

CodePudding user response：

We can use stringr::str_match(). It will also work for different that six digits.

library(tidyverse)

files <- c("testing1_010000.csv", "check3_012000.csv", "testing_checking_045880.csv", "test_check2_350000.csv")



str_match(files, '(.*_)\\d \\.csv$')[, 2]
#> [1] "testing1_"         "check3_"           "testing_checking_"
#> [4] "test_check2_"

The regex can be interpreted as: "capture everything before and including an underscore, that is then followed by one or more digits .csv as an ending"

^{Created on 2021-12-03 by the reprex package (v2.0.1)}