How to load only the most recent file from a directory where the filenames startswith the date?-CodePudding

I have files in one directory/folder named:

2022-07-31_DATA_GVAX_ARPA_COMBINED.csv
2022-08-31_DATA_GVAX_ARPA_COMBINED.csv
2022-09-30_DATA_GVAX_ARPA_COMBINED.csv

The folder will be updated with each month's file in the same format as above eg.:

2022-10-31_DATA_GVAX_ARPA_COMBINED.csv
2022-11-30_DATA_GVAX_ARPA_COMBINED.csv

I want to only load the most recent month's .csv into a pandas dataframe, not all the files. How can I do this (maybe using glob)?

I have seen this used for prefixes using:

dir_files = r'/path/to/folder/*'

dico={}

for file in Path(dir_files).glob('DATA_GVAX_COMBINED_*.csv'):
    dico[file.stem.split('_')[-1]] = file

max_date = max(dico)

CodePudding user response：

You could try something like this:


import pandas as pd
from pathlib import Path


dir_files = r'/path/to/folder/*'

dico = {}

for file in Path(dir_files).glob('*DATA_GVAX_ARPA_COMBINED*.csv'):
    date_value = pd.to_datetime(file.name.split('_')[0], errors="coerce")
    if pd.notna(date_value):
        dico[date_value] = file

max_date = max(dico.keys())
filepath = dico[max_date]
print(f'{max_date} -> {filepath}')
# Prints:
#
# 2022-10-31 00:00:00 -> 2022-10-31_DATA_GVAX_ARPA_COMBINED.csv

CodePudding user response：

Glob the directory with the pattern for known files of interest. Sort (natural) on the basename.

from glob import glob as GLOB
from os.path import join as JOIN, basename as BASENAME

def get_latest(directory):
    if all_files := list(GLOB(JOIN(directory, '*_DATA_GVAX_ARPA_COMBINED.csv'))):
        return sorted(all_files, key=BASENAME)[-1]

print(get_latest('/Users/Cobra'))