Suppose we have 30 files in a folder such that 1234567_val_2022-02-01.csv 1234567_train_2022-02-01.csv 1234567_test_2022-02-01.csv 1234567_val_2022-02-02.csv 1234567_train_2022-02-02.csv 1234567_test_2022-02-02.csv 1234567_val_2022-02-03.csv 1234567_train_2022-02-03.csv 1234567_test_2022-02-03.csv 1234567_val_2022-02-04.csv 1234567_train_2022-02-04.csv 1234567_test_2022-02-04.csv 1234567_val_2022-02-05.csv 1234567_train_2022-02-05.csv 1234567_test_2022-02-05.csv 1234568_val_2022-02-01.csv 1234568_train_2022-02-01.csv 1234568_test_2022-02-01.csv 1234568_val_2022-02-02.csv 1234568_train_2022-02-02.csv 1234568_test_2022-02-02.csv 1234568_val_2022-02-03.csv 1234568_train_2022-02-03.csv 1234568_test_2022-02-03.csv 1234568_val_2022-02-04.csv 1234568_train_2022-02-04.csv 1234568_test_2022-02-04.csv 1234568_val_2022-02-05.csv 1234568_train_2022-02-05.csv 1234568_test_2022-02-05.csv
where first seven characters are 1234567, 1234567.. unique ID and 2022-02-01, 2022-02-02 ..are date in format (%Y%M%D). How will we list all train, test and val .csv files between 2022-02-01 and 2022-02-03 in python?
output:
train files between 2022-02-01 and 2022-02-03:
1234567_train_2022-02-01.csv 1234567_train_2022-02-02.csv 1234568_train_2022-02-03.csv 1234568_train_2022-02-01.csv 1234568_train_2022-02-02.csv 1234568_train_2022-02-03.csv
test files between 2022-02-01 and 2022-02-03:
1234567_test_2022-02-01.csv 1234567_test_2022-02-02.csv 1234568_test_2022-02-03.csv 1234568_test_2022-02-01.csv 1234568_test_2022-02-02.csv 1234568_test_2022-02-03.csv
val files:
1234567_val_2022-02-01.csv 1234567_val_2022-02-02.csv 1234568_val_2022-02-03.csv 1234568_val_2022-02-01.csv 1234568_val_2022-02-02.csv 1234568_val_2022-02-03.csv
CodePudding user response:
I would suggest to use parse (https://pypi.org/project/parse/)
import parse
name = "1234567_train_2022-02-01.csv"
result = parse.parse("{}_{}_{}-{}-{}.csv", name)
print(result)
if result[2] == '2022' and result[3] == '02' and result[4] >= '01' and result[4] <= '03':
# do something
CodePudding user response:
Another approach could be to extract the date from the filename, parse it by converting it to datetime.datetime
. We can then iterate over each date and compare with the deadlines.
files = [[datetime.datetime.strptime(os.path.splitext(file)[0][-10:], '%Y-%m-%d'), file] for file in os.listdir()]
start = datetime.datetime(2022, 2, 1)
end = datetime.datetime(2022, 2, 3)
for file in files:
if start <= file[0] <= end:
print(file[1])
To retrieve the date, we used os.path.splitext()
. We then kept only the last 10 characters.
To separate files according to their id
files = [os.path.splitext(file)[0].split('_') for file in os.listdir()]
start = datetime.datetime(2022, 2, 1)
end = datetime.datetime(2022, 2, 3)
valid_files = {}
for file in files:
if start <= datetime.datetime.strptime(file[-1], '%Y-%m-%d') <= end:
filename = "_".join(file) ".csv"
if file[1] in valid_files:
valid_files[file[1]].append(filename)
else:
valid_files[file[1]] = [filename]
print(valid_files)
Output:
{
'val': ['1234567_val_2022-02-01.csv', '1234567_val_2022-02-02.csv', '1234567_val_2022-02-03.csv', '1234568_val_2022-02-01.csv', '1234568_val_2022-02-02.csv', '1234568_val_2022-02-03.csv'],
'train': ['1234567_train_2022-02-01.csv', '1234567_train_2022-02-02.csv', '1234567_train_2022-02-03.csv', '1234568_train_2022-02-01.csv', '1234568_train_2022-02-02.csv', '1234568_train_2022-02-03.csv'],
'test': ['1234567_test_2022-02-01.csv', '1234567_test_2022-02-02.csv', '1234567_test_2022-02-03.csv', '1234568_test_2022-02-01.csv', '1234568_test_2022-02-02.csv', '1234568_test_2022-02-03.csv']
}