Home > Software design >  How to get the list of csv files in a directory sorted by creation date in Python
How to get the list of csv files in a directory sorted by creation date in Python

Time:05-10

I need to get the list of ".csv" files in a directory, sorted by creation date.

I use this function:

from os import listdir
from os.path import isfile, join, getctime

def get_sort_files(path, file_extension):
    list_of_files = filter(lambda x: isfile(join(path, x)),listdir(path)) 
    list_of_files = sorted(list_of_files, key=lambda x: getctime(join(path, x)))
    list_of_files = [file for file in list_of_files if file.endswith(file_extension)] # keep only csv files
    return list_of_files

It works fine when I use it in directories that contain a small number of csv files (e.g. 500), but it's very slow when I use it in directories that contain 50000 csv files: it takes about 50 seconds to return.

How can I modify it? Or can I use a better alternative function?

EDIT1:

The bottleneck is the sorted function, so I must find an alternative to sort the files by creation date without using it

EDIT2:

I only need the oldest file (the first if sorted by creation date), so maybe I don't need to sort all the files. Can I just pick the oldest one?

CodePudding user response:

You should start by only examining the creation time on relevant files. You can do this by using glob() to return the files of interest.

Build a list of 2-tuples - i.e., (creation time, file name)

A sort of that list will implicitly be performed on the first item in each tuple (the creation date).

Then you can return a list of files in the required order.

from glob import glob
from os.path import join, getctime

def get_sort_files(path, extension):
    list_of_files = []
    for file in glob(join(path,f'*{extension}')):
        list_of_files.append((getctime(file), file))
    return [file for _, file in sorted(list_of_files)]

print(get_sort_files('some directory', 'csv'))

Edit:

I created a directory with 50,000 dummy CSV files and timed the code shown in this answer. It took 0.24s

Edit 2:

OP only wants oldest file. In which case:

def get_oldest_file(path, extension):
    ctime = float('inf')
    old_file = None
    for file in glob(join(path,f'*{extension}')):
        if (ctime_ := getctime(file)) < ctime:
            ctime = ctime_
            old_file = file
    return old_file

CodePudding user response:

You could try using os.scandir:

from os import scandir

def get_sort_files(path, file_extension):
    """Return the oldest file in path with correct file extension"""
    list_of_files = [(d.stat().st_ctime, d.path) for d in scandir(path) if d.is_file() and d.path.endswith(file_extension)]
    list_of_files.sort()
    return list_of_files[0][1]

os.scandir seems to used less calls to stat. See this post for details. I could see much better performance on a sample folder with 5000 csv files.

CodePudding user response:

You could try the following code:

def get_sort_files(path, file_extension):
    list_of_files = [file for file in listdir(path) if isfile(join(path, file)) and file.endswith(file_extension)]
    list_of_files.sort(key=lambda x: getctime(join(path, x)))
    return list_of_files

This version could have better performance especially on big folders. It uses a list comprehension directly at the beginning to ignore irrelevant files right from the beginning. It uses in-place sorting.

This way, this code uses only one list. In your code, you create multiple lists in memory and the data has to be copied each time:

  1. listdir(path) returns the initial list of filenames
  2. sorted(...) returns a filtered and sorted copy of the initial list
  3. The list comprehension before the return statement creates another new list

CodePudding user response:

You can try this method:

def get_sort_files(path, extention):
    # Relative path generator
    sort_paths = (join(path, i)
                  for i in listdir(path) if i.endswith(extention))
    sort_paths = sorted(sort_paths, key=getctime)

    return sort_paths
# Include the . char to be explicit
>>> get_sort_files("dir", ".csv")
['dir/new.csv', 'dir/test.csv']

However, all file names are in a relative path; folder/file.csv. A slightly less efficient work-around would be to use a lambda key again:

def get_sort_files(path, extention):
    # File name generator
    sort_paths = (i for i in listdir(path) if i.endswith(extention))
    sort_paths = sorted(sort_paths, key=lambda x: getctime(join(path, x)))

    return sort_paths
>>> get_sort_files("dir", ".csv")
['new.csv', 'test.csv']

Edit for avoiding sorted():

Using min():

This is the fastest method of all listed in this answer

def get_sort_files(path, extention):
    # Relative path generator
    sort_paths = (join(path, i) for i in listdir(path) if i.endswith(extention))
    return min(sort_paths, key=getctime)

Manually:

def get_sort_files(path, extention):
    # Relative path generator
    sort_paths = [join(path, i) for i in listdir(path) if i.endswith(extention)]

    oldest = (getctime(sort_paths[0]), sort_paths[0])
    for i in sort_paths[1:]:
        t = getctime(i)
        if t < oldest[0]:
            oldest = (t, i)

    return oldest[1]
  • Related