Home > front end >  What is the fastest method to extract the file name from a path in R
What is the fastest method to extract the file name from a path in R

Time:05-24

I am working with a large list of file paths/urls (~100 million) in R and I need to extract the filenames from the paths. I was using base::basename as suggested in this Stack Overflow answer but it was slow when operating on a large vector of paths.

I am looking for a fast way to extract the filename (after the rightmost "/").

CodePudding user response:

I have tried a few different solutions and sub(".*/", "", files, perl = T) is the fastest and an order of magnitude faster than basename. I imagine basename is much "safer" but for my paths in the form of a url with no special characters, these methods seem to do the trick.

I would love to hear if anyone can come up with a faster method.

library(fs)
library(stringr)
library(microbenchmark)

files<-paste0("http://some/ppath/to/som/cool/file/",1:1000,".flac")

head(files)
[1] "http://some/ppath/to/som/cool/file/1.flac" "http://some/ppath/to/som/cool/file/2.flac"
[3] "http://some/ppath/to/som/cool/file/3.flac" "http://some/ppath/to/som/cool/file/4.flac"
[5] "http://some/ppath/to/som/cool/file/5.flac" "http://some/ppath/to/som/cool/file/6.flac"

Compare speeds

microbenchmark(
    basename(files),
    path_file(files), #from fs
    gsub(".*/", "", files),
    str_remove(files,".*/"), #from stringr
    sub(".*/", "", files),
    gsub(".*/", "", files,perl = T),
    sub(".*/", "", files,perl = T)
)
Unit: microseconds
                             expr      min        lq      mean    median        uq      max neval
                  basename(files) 5178.401 5410.5510 5541.2861 5499.3005 5603.6505 6690.302   100
                 path_file(files) 5281.401 5479.3010 5702.2419 5593.0510 5758.8010 8492.601   100
           gsub(".*/", "", files) 1109.701 1154.4515 1207.9010 1180.5010 1225.9010 1648.501   100
         str_remove(files, ".*/")  640.902  687.6010  763.1531  746.9505  816.7015 1192.601   100
            sub(".*/", "", files)  827.902  864.1505  910.9800  877.4015  902.0515 1393.701   100
 gsub(".*/", "", files, perl = T)  474.501  494.5515  528.1700  506.4510  533.6505  915.201   100
  sub(".*/", "", files, perl = T)  426.200  442.1505  466.3699  451.8015  480.0010  617.401   100

CodePudding user response:

Your sub is not that fast. You are using alot of backtracking. consider the following as compared to your fastest method:

sub("[a-z:/] /",'', files, perl = T) # 1.5x
substr(files, regexpr("(?=[^/]  $)", head(files), perl = T), nchar(files)) # 3x

This is 3 times faster than your method.

  • Related