I have files in one directory/folder named:
2022-07-31_DATA_GVAX_ARPA_COMBINED.csv
2022-08-31_DATA_GVAX_ARPA_COMBINED.csv
2022-09-30_DATA_GVAX_ARPA_COMBINED.csv
The folder will be updated with each month's file in the same format as above eg.:
2022-10-31_DATA_GVAX_ARPA_COMBINED.csv
2022-11-30_DATA_GVAX_ARPA_COMBINED.csv
I want to only load the most recent month's .csv into a pandas dataframe, not all the files. How can I do this (maybe using glob)?
I have seen this used for prefixes using:
dir_files = r'/path/to/folder/*'
dico={}
for file in Path(dir_files).glob('DATA_GVAX_COMBINED_*.csv'):
dico[file.stem.split('_')[-1]] = file
max_date = max(dico)
CodePudding user response:
You could try something like this:
import pandas as pd
from pathlib import Path
dir_files = r'/path/to/folder/*'
dico = {}
for file in Path(dir_files).glob('*DATA_GVAX_ARPA_COMBINED*.csv'):
date_value = pd.to_datetime(file.name.split('_')[0], errors="coerce")
if pd.notna(date_value):
dico[date_value] = file
max_date = max(dico.keys())
filepath = dico[max_date]
print(f'{max_date} -> {filepath}')
# Prints:
#
# 2022-10-31 00:00:00 -> 2022-10-31_DATA_GVAX_ARPA_COMBINED.csv
CodePudding user response:
Glob the directory with the pattern for known files of interest. Sort (natural) on the basename.
from glob import glob as GLOB
from os.path import join as JOIN, basename as BASENAME
def get_latest(directory):
if all_files := list(GLOB(JOIN(directory, '*_DATA_GVAX_ARPA_COMBINED.csv'))):
return sorted(all_files, key=BASENAME)[-1]
print(get_latest('/Users/Cobra'))