I have files in one directory/folder named:
sacoronavirus_total_number_vaccinated_type_2022-04-30.csv
sacoronavirus_total_number_vaccinated_type_2022-05-31.csv
sacoronavirus_total_number_vaccinated_type_2022-06-30.csv
sacoronavirus_total_number_vaccinated_type_2022-07-31.csv
sacoronavirus_total_number_vaccinated_type_2022-08-31.csv
The folder will be updated with each month's file in the same format as above. e.g.
sacoronavirus_total_number_vaccinated_type_2022-09-30.csv
sacoronavirus_total_number_vaccinated_type_2022-10-31.csv
I want to only load the most recent month's .csv into a pandas dataframe, not all the files. How can I do this (maybe using glob)?
code below get the most file by metadat loading date but not the string of the filename
import glob
import os
list_of_files = glob.glob('/path/to/folder/*') # * means all if need specific format then *.csv
latest_file = max(list_of_files, key=os.path.getctime)
print(latest_file)
Note there are other file in the same directory with different prefixes.
CodePudding user response:
If all files have the same prefix, then all you need to do is get the last file in the sorted list of names, since ISO 8601 date strings are lexicographically comparable:
import glob
import os
list_of_files = sorted(glob.glob('/path/to/folder/.csv*'))
latest_file = list_of_files[-1]
print(latest_file)
In fact, just
latest_file = max(glob.glob('/path/to/folder/.csv*'))
works too, if you don't need the full list for anything.
CodePudding user response:
Here is a proposition to select the most recent file by the filename using pathlib
.
from pathlib import Path
from datetime import datetime
import pandas as pd
dir_files = r'/path/to/folder/*'
dico={}
for file in Path(dir_files).glob('sacoronavirus_total_number_vaccinated_*.csv'):
dico[file.stem.split('_')[-1]] = file
max_date = max(dico, key=lambda x: datetime.strptime(x, '%Y-%m-%d'))
Then, you can use pandas.read_csv
and pass the file path to create a dataframe.
df = pd.read_csv(dico[max_date])