I need to get the list of ".csv" files in a directory, sorted by creation date.
I use this function:
from os import listdir
from os.path import isfile, join, getctime
def get_sort_files(path, file_extension):
list_of_files = filter(lambda x: isfile(join(path, x)),listdir(path))
list_of_files = sorted(list_of_files, key=lambda x: getctime(join(path, x)))
list_of_files = [file for file in list_of_files if file.endswith(file_extension)] # keep only csv files
return list_of_files
It works fine when I use it in directories that contain a small number of csv files (e.g. 500), but it's very slow when I use it in directories that contain 50000 csv files: it takes about 50 seconds to return.
How can I modify it? Or can I use a better alternative function?
EDIT1:
The bottleneck is the sorted
function, so I must find an alternative to sort the files by creation date without using it
EDIT2:
I only need the oldest file (the first if sorted by creation date), so maybe I don't need to sort all the files. Can I just pick the oldest one?
CodePudding user response:
You should start by only examining the creation time on relevant files. You can do this by using glob() to return the files of interest.
Build a list of 2-tuples - i.e., (creation time, file name)
A sort of that list will implicitly be performed on the first item in each tuple (the creation date).
Then you can return a list of files in the required order.
from glob import glob
from os.path import join, getctime
def get_sort_files(path, extension):
list_of_files = []
for file in glob(join(path,f'*{extension}')):
list_of_files.append((getctime(file), file))
return [file for _, file in sorted(list_of_files)]
print(get_sort_files('some directory', 'csv'))
Edit:
I created a directory with 50,000 dummy CSV files and timed the code shown in this answer. It took 0.24s
Edit 2:
OP only wants oldest file. In which case:
def get_oldest_file(path, extension):
ctime = float('inf')
old_file = None
for file in glob(join(path,f'*{extension}')):
if (ctime_ := getctime(file)) < ctime:
ctime = ctime_
old_file = file
return old_file
CodePudding user response:
You could try using os.scandir:
from os import scandir
def get_sort_files(path, file_extension):
"""Return the oldest file in path with correct file extension"""
list_of_files = [(d.stat().st_ctime, d.path) for d in scandir(path) if d.is_file() and d.path.endswith(file_extension)]
list_of_files.sort()
return list_of_files[0][1]
os.scandir seems to used less calls to stat. See this post for details. I could see much better performance on a sample folder with 5000 csv files.
CodePudding user response:
You could try the following code:
def get_sort_files(path, file_extension):
list_of_files = [file for file in listdir(path) if isfile(join(path, file)) and file.endswith(file_extension)]
list_of_files.sort(key=lambda x: getctime(join(path, x)))
return list_of_files
This version could have better performance especially on big folders. It uses a list comprehension directly at the beginning to ignore irrelevant files right from the beginning. It uses in-place sorting.
This way, this code uses only one list. In your code, you create multiple lists in memory and the data has to be copied each time:
listdir(path)
returns the initial list of filenamessorted(...)
returns a filtered and sorted copy of the initial list- The list comprehension before the return statement creates another new list
CodePudding user response:
You can try this method:
def get_sort_files(path, extention):
# Relative path generator
sort_paths = (join(path, i)
for i in listdir(path) if i.endswith(extention))
sort_paths = sorted(sort_paths, key=getctime)
return sort_paths
# Include the . char to be explicit
>>> get_sort_files("dir", ".csv")
['dir/new.csv', 'dir/test.csv']
However, all file names are in a relative path; folder/file.csv
. A slightly less efficient work-around would be to use a lambda
key again:
def get_sort_files(path, extention):
# File name generator
sort_paths = (i for i in listdir(path) if i.endswith(extention))
sort_paths = sorted(sort_paths, key=lambda x: getctime(join(path, x)))
return sort_paths
>>> get_sort_files("dir", ".csv")
['new.csv', 'test.csv']
Edit for avoiding sorted()
:
Using min()
:
This is the fastest method of all listed in this answer
def get_sort_files(path, extention):
# Relative path generator
sort_paths = (join(path, i) for i in listdir(path) if i.endswith(extention))
return min(sort_paths, key=getctime)
Manually:
def get_sort_files(path, extention):
# Relative path generator
sort_paths = [join(path, i) for i in listdir(path) if i.endswith(extention)]
oldest = (getctime(sort_paths[0]), sort_paths[0])
for i in sort_paths[1:]:
t = getctime(i)
if t < oldest[0]:
oldest = (t, i)
return oldest[1]