How to load only the most recent file from a directory where the filenames endswith the date?-CodePudding

I have files in one directory/folder named:

sacoronavirus_total_number_vaccinated_type_2022-04-30.csv
sacoronavirus_total_number_vaccinated_type_2022-05-31.csv
sacoronavirus_total_number_vaccinated_type_2022-06-30.csv
sacoronavirus_total_number_vaccinated_type_2022-07-31.csv
sacoronavirus_total_number_vaccinated_type_2022-08-31.csv

The folder will be updated with each month's file in the same format as above. e.g.

sacoronavirus_total_number_vaccinated_type_2022-09-30.csv
sacoronavirus_total_number_vaccinated_type_2022-10-31.csv

I want to only load the most recent month's .csv into a pandas dataframe, not all the files. How can I do this (maybe using glob)?

code below get the most file by metadat loading date but not the string of the filename

import glob
import os

list_of_files = glob.glob('/path/to/folder/*') # * means all if need specific format then *.csv
latest_file = max(list_of_files, key=os.path.getctime)
print(latest_file)

Note there are other file in the same directory with different prefixes.

CodePudding user response：

If all files have the same prefix, then all you need to do is get the last file in the sorted list of names, since ISO 8601 date strings are lexicographically comparable:

import glob
import os

list_of_files = sorted(glob.glob('/path/to/folder/.csv*'))
latest_file = list_of_files[-1]
print(latest_file)

In fact, just

latest_file = max(glob.glob('/path/to/folder/.csv*'))

works too, if you don't need the full list for anything.

CodePudding user response：

Here is a proposition to select the most recent file by the filename using pathlib.

from pathlib import Path
from datetime import datetime
import pandas as pd

dir_files = r'/path/to/folder/*'

dico={}

for file in Path(dir_files).glob('sacoronavirus_total_number_vaccinated_*.csv'):
    dico[file.stem.split('_')[-1]] = file

max_date = max(dico, key=lambda x: datetime.strptime(x, '%Y-%m-%d'))

Then, you can use pandas.read_csv and pass the file path to create a dataframe.

df = pd.read_csv(dico[max_date])