How to select specific csv files for specified date range from a folder in python?-CodePudding

I have a folder (existing in the same directory as the python script) with a lot of csv files starting from 1st Jan to 31st Dec and I want to read only specific csv files within a certain date range from the folder into python and later appending the files into a list.

The files are named as below and there are files for each day of multiple months:

BANK_NIFTY_5MINs_2020-02-01.csv, BANK_NIFTY_5MINs_2020-02-02.csv, ... BANK_NIFTY_5MINs_2020-02-28.csv, BANK_NIFTY_5MINs_2020-03-01, .... BANK_NIFTY_5MINs_2020-03-31 and so on.

Currently, I have the code to fetch the csv files of the whole month of March by using the 'startswith' and 'endswith' syntax. However, doing this allows me to target files for only one month at a time. I want to be able to read multiple months of csv files in within a specified date range for example Oct, Nov and Dec or Feb and March (Basically start and end at any month).

The following code gets only the files for March.

#Accessing csv files from directory
all_files = []
path = os.getcwd()    
for root, dirs, files in os.walk(path):
    for file in files:
        if file.startswith("/BANK_NIFTY_5MINs_2020-03*.csv") and file.endswith(".csv"):
            all_files.append(os.path.join(root, file))

CodePudding user response：

If you want to do it with regex, here it is:

# replace `file.startswith(...) and file.endswith(...)`
re.match('BANK_NIFTY_5MINs_2020-(02|03|10|11|12)-[0-9] ', file)
###                              ^^^^^^^^^^^^^^ Feb, Mar, Oct-Dec

It's the most basic one to get you going, it might be improved.

But in your case I'd go with simple glob:

all_files = glob.glob('./BANK_NIFTY_5MINs_2020-0[2-3]-*.csv')

CodePudding user response：

I would have a different approach for more flexibility

import os
from datetime import datetime
from pprint import pprint


def quick_str_to_date(s: str) -> datetime:
    return datetime.strptime(s, "%Y-%m-%d")


def get_file_by_date_range(path: str, startdate: datetime or str, enddate: datetime or str) -> list:
    if type(startdate) == str:
        startdate = quick_str_to_date(startdate)
    if type(enddate) == str:
        enddate = quick_str_to_date(enddate)
    result = []   
    for root, dirs, files in os.walk(path):
        for filename in files:
            if filename.startswith("BANK_NIFTY_5MINs_") and filename.lower().endswith(".csv"):
                file_date = datetime.strptime(os.path.basename(filename), "BANK_NIFTY_5MINs_%Y-%m-%d.csv")
                if startdate <= file_date <= enddate:
                    result.append(filename)
    return result


print("all")
pprint(get_file_by_date_range("path/to/files", "2000-01-01", "2100-12-31"))

print("\nfebuari")
pprint(get_file_by_date_range("path/to/files", "2020-02-01", "2020-02-28"))

print("\none day")
pprint(get_file_by_date_range("path/to/files", "2020-02-01", "2020-02-01"))

output

all
['BANK_NIFTY_5MINs_2020-02-01.csv',
 'BANK_NIFTY_5MINs_2020-02-02.csv',
 'BANK_NIFTY_5MINs_2020-02-28.csv',
 'BANK_NIFTY_5MINs_2020-03-01.csv',
 'BANK_NIFTY_5MINs_2020-03-31.csv']

febuari
['BANK_NIFTY_5MINs_2020-02-01.csv',
 'BANK_NIFTY_5MINs_2020-02-02.csv',
 'BANK_NIFTY_5MINs_2020-02-28.csv']

one day
['BANK_NIFTY_5MINs_2020-02-01.csv']