I am working with a large list of file paths/urls (~100 million) in R and I need to extract the filenames from the paths. I was using base::basename
as suggested in this Stack Overflow answer but it was slow when operating on a large vector of paths.
I am looking for a fast way to extract the filename (after the rightmost "/").
CodePudding user response:
I have tried a few different solutions and sub(".*/", "", files, perl = T)
is the fastest and an order of magnitude faster than basename
. I imagine basename is much "safer" but for my paths in the form of a url with no special characters, these methods seem to do the trick.
I would love to hear if anyone can come up with a faster method.
library(fs)
library(stringr)
library(microbenchmark)
files<-paste0("http://some/ppath/to/som/cool/file/",1:1000,".flac")
head(files)
[1] "http://some/ppath/to/som/cool/file/1.flac" "http://some/ppath/to/som/cool/file/2.flac"
[3] "http://some/ppath/to/som/cool/file/3.flac" "http://some/ppath/to/som/cool/file/4.flac"
[5] "http://some/ppath/to/som/cool/file/5.flac" "http://some/ppath/to/som/cool/file/6.flac"
Compare speeds
microbenchmark(
basename(files),
path_file(files), #from fs
gsub(".*/", "", files),
str_remove(files,".*/"), #from stringr
sub(".*/", "", files),
gsub(".*/", "", files,perl = T),
sub(".*/", "", files,perl = T)
)
Unit: microseconds
expr min lq mean median uq max neval
basename(files) 5178.401 5410.5510 5541.2861 5499.3005 5603.6505 6690.302 100
path_file(files) 5281.401 5479.3010 5702.2419 5593.0510 5758.8010 8492.601 100
gsub(".*/", "", files) 1109.701 1154.4515 1207.9010 1180.5010 1225.9010 1648.501 100
str_remove(files, ".*/") 640.902 687.6010 763.1531 746.9505 816.7015 1192.601 100
sub(".*/", "", files) 827.902 864.1505 910.9800 877.4015 902.0515 1393.701 100
gsub(".*/", "", files, perl = T) 474.501 494.5515 528.1700 506.4510 533.6505 915.201 100
sub(".*/", "", files, perl = T) 426.200 442.1505 466.3699 451.8015 480.0010 617.401 100
CodePudding user response:
Your sub
is not that fast. You are using alot of backtracking. consider the following as compared to your fastest method:
sub("[a-z:/] /",'', files, perl = T) # 1.5x
substr(files, regexpr("(?=[^/] $)", head(files), perl = T), nchar(files)) # 3x
This is 3 times faster than your method.