How to get the list of csv files in a directory sorted by creation date in Python


I need to get the list of ".csv" files in a directory, sorted by creation date.

I use this function:

from os import listdir
from os.path import isfile, join, getctime

def get_sort_files(path, file_extension):
    list_of_files = filter(lambda x: isfile(join(path, x)),listdir(path)) 
    list_of_files = sorted(list_of_files, key=lambda x: getctime(join(path, x)))
    list_of_files = [file for file in list_of_files if file.endswith(file_extension)] # keep only csv files
    return list_of_files

It works fine when I use it in directories that contain a small number of csv files (e.g. 500), but it's very slow when I use it in directories that contain 50000 csv files: it takes about 50 seconds to return.

How can I modify it? Or can I use a better alternative function?


The bottleneck is the sorted function, so I must find an alternative to sort the files by creation date without using it


I only need the oldest file (the first if sorted by creation date), so maybe I don't need to sort all the files. Can I just pick the oldest one?

You should start by only examining the creation time on relevant files. You can do this by using glob() to return the files of interest.

Build a list of 2-tuples - i.e., (creation time, file name)

A sort of that list will implicitly be performed on the first item in each tuple (the creation date).

Then you can return a list of files in the required order.

from glob import glob
from os.path import join, getctime

def get_sort_files(path, extension):
    list_of_files = []
    for file in glob(join(path,f'*{extension}')):
        list_of_files.append((getctime(file), file))
    return [file for _, file in sorted(list_of_files)]

print(get_sort_files('some directory', 'csv'))


I created a directory with 50,000 dummy CSV files and timed the code shown in this answer. It took 0.24s

Edit 2:

OP only wants oldest file. In which case:

def get_oldest_file(path, extension):
    ctime = float('inf')
    old_file = None
    for file in glob(join(path,f'*{extension}')):
        if (ctime_ := getctime(file)) < ctime:
            ctime = ctime_
            old_file = file
    return old_file

You could try using os.scandir:

from os import scandir

def get_sort_files(path, file_extension):
    """Return the oldest file in path with correct file extension"""
    list_of_files = [(d.stat().st_ctime, d.path) for d in scandir(path) if d.is_file() and d.path.endswith(file_extension)]
    return list_of_files[0][1]

os.scandir seems to used less calls to stat. See this post for details. I could see much better performance on a sample folder with 5000 csv files.

You could try the following code:

def get_sort_files(path, file_extension):
    list_of_files = [file for file in listdir(path) if isfile(join(path, file)) and file.endswith(file_extension)]
    list_of_files.sort(key=lambda x: getctime(join(path, x)))
    return list_of_files

This version could have better performance especially on big folders. It uses a list comprehension directly at the beginning to ignore irrelevant files right from the beginning. It uses in-place sorting.

This way, this code uses only one list. In your code, you create multiple lists in memory and the data has to be copied each time:

  1. listdir(path) returns the initial list of filenames
  2. sorted(...) returns a filtered and sorted copy of the initial list
  3. The list comprehension before the return statement creates another new list

You can try this method:

def get_sort_files(path, extention):
    # Relative path generator
    sort_paths = (join(path, i)
                  for i in listdir(path) if i.endswith(extention))
    sort_paths = sorted(sort_paths, key=getctime)

    return sort_paths
# Include the . char to be explicit
>>> get_sort_files("dir", ".csv")
['dir/new.csv', 'dir/test.csv']

However, all file names are in a relative path; folder/file.csv. A slightly less efficient work-around would be to use a lambda key again:

def get_sort_files(path, extention):
    # File name generator
    sort_paths = (i for i in listdir(path) if i.endswith(extention))
    sort_paths = sorted(sort_paths, key=lambda x: getctime(join(path, x)))

    return sort_paths
>>> get_sort_files("dir", ".csv")
['new.csv', 'test.csv']

Edit for avoiding sorted():

Using min():

This is the fastest method of all listed in this answer

def get_sort_files(path, extention):
    # Relative path generator
    sort_paths = (join(path, i) for i in listdir(path) if i.endswith(extention))
    return min(sort_paths, key=getctime)


def get_sort_files(path, extention):
    # Relative path generator
    sort_paths = [join(path, i) for i in listdir(path) if i.endswith(extention)]

    oldest = (getctime(sort_paths[0]), sort_paths[0])
    for i in sort_paths[1:]:
        t = getctime(i)
        if t < oldest[0]:
            oldest = (t, i)

    return oldest[1]
