Home > Mobile >  Extract all files named datewise belong to specific time period using Python
Extract all files named datewise belong to specific time period using Python

Time:02-19

Suppose we have 30 files in a folder such that 1234567_val_2022-02-01.csv 1234567_train_2022-02-01.csv 1234567_test_2022-02-01.csv 1234567_val_2022-02-02.csv 1234567_train_2022-02-02.csv 1234567_test_2022-02-02.csv 1234567_val_2022-02-03.csv 1234567_train_2022-02-03.csv 1234567_test_2022-02-03.csv 1234567_val_2022-02-04.csv 1234567_train_2022-02-04.csv 1234567_test_2022-02-04.csv 1234567_val_2022-02-05.csv 1234567_train_2022-02-05.csv 1234567_test_2022-02-05.csv 1234568_val_2022-02-01.csv 1234568_train_2022-02-01.csv 1234568_test_2022-02-01.csv 1234568_val_2022-02-02.csv 1234568_train_2022-02-02.csv 1234568_test_2022-02-02.csv 1234568_val_2022-02-03.csv 1234568_train_2022-02-03.csv 1234568_test_2022-02-03.csv 1234568_val_2022-02-04.csv 1234568_train_2022-02-04.csv 1234568_test_2022-02-04.csv 1234568_val_2022-02-05.csv 1234568_train_2022-02-05.csv 1234568_test_2022-02-05.csv

where first seven characters are 1234567, 1234567.. unique ID and 2022-02-01, 2022-02-02 ..are date in format (%Y%M%D). How will we list all train, test and val .csv files between 2022-02-01 and 2022-02-03 in python?

output:

train files between 2022-02-01 and 2022-02-03:

1234567_train_2022-02-01.csv 1234567_train_2022-02-02.csv 1234568_train_2022-02-03.csv 1234568_train_2022-02-01.csv 1234568_train_2022-02-02.csv 1234568_train_2022-02-03.csv

test files between 2022-02-01 and 2022-02-03:

1234567_test_2022-02-01.csv 1234567_test_2022-02-02.csv 1234568_test_2022-02-03.csv 1234568_test_2022-02-01.csv 1234568_test_2022-02-02.csv 1234568_test_2022-02-03.csv

val files:

1234567_val_2022-02-01.csv 1234567_val_2022-02-02.csv 1234568_val_2022-02-03.csv 1234568_val_2022-02-01.csv 1234568_val_2022-02-02.csv 1234568_val_2022-02-03.csv

CodePudding user response:

I would suggest to use parse (https://pypi.org/project/parse/)

import parse

name = "1234567_train_2022-02-01.csv"
result = parse.parse("{}_{}_{}-{}-{}.csv", name)
print(result)
if result[2] == '2022' and result[3] == '02' and result[4] >= '01' and result[4] <= '03':
   # do something

CodePudding user response:

Another approach could be to extract the date from the filename, parse it by converting it to datetime.datetime. We can then iterate over each date and compare with the deadlines.

files = [[datetime.datetime.strptime(os.path.splitext(file)[0][-10:], '%Y-%m-%d'), file] for file in os.listdir()]

start = datetime.datetime(2022, 2, 1)
end = datetime.datetime(2022, 2, 3)

for file in files:
    if start <= file[0] <= end:
        print(file[1])

To retrieve the date, we used os.path.splitext(). We then kept only the last 10 characters.


To separate files according to their id

files = [os.path.splitext(file)[0].split('_') for file in os.listdir()]

start = datetime.datetime(2022, 2, 1)
end = datetime.datetime(2022, 2, 3)

valid_files = {}
for file in files:
    if start <= datetime.datetime.strptime(file[-1], '%Y-%m-%d') <= end:
        filename = "_".join(file)   ".csv"
        if file[1] in valid_files:
            valid_files[file[1]].append(filename)
        else:
            valid_files[file[1]] = [filename]
print(valid_files)

Output:

{
    'val': ['1234567_val_2022-02-01.csv', '1234567_val_2022-02-02.csv', '1234567_val_2022-02-03.csv', '1234568_val_2022-02-01.csv', '1234568_val_2022-02-02.csv', '1234568_val_2022-02-03.csv'],
    'train': ['1234567_train_2022-02-01.csv', '1234567_train_2022-02-02.csv', '1234567_train_2022-02-03.csv', '1234568_train_2022-02-01.csv', '1234568_train_2022-02-02.csv', '1234568_train_2022-02-03.csv'],
    'test': ['1234567_test_2022-02-01.csv', '1234567_test_2022-02-02.csv', '1234567_test_2022-02-03.csv', '1234568_test_2022-02-01.csv', '1234568_test_2022-02-02.csv', '1234568_test_2022-02-03.csv']
 }
  • Related