Home > database >  How to identify data gaps based on filenames on Python?
How to identify data gaps based on filenames on Python?

Time:02-17

It happens that I have a folder located at

C:\Users\StoreX\Downloads\Binance futures data\AliceUSDT-Mark_Prices_Klines_1h_Timeframe

which only contains 253 csv files with the following filenames:

1. ALICEUSDT-1h-2021-06-01.csv
2. ALICEUSDT-1h-2021-06-02.csv
3. ALICEUSDT-1h-2021-06-03.csv
4. ALICEUSDT-1h-2021-06-06.csv
5. ALICEUSDT-1h-2021-06-09.csv
6. ALICEUSDT-1h-2021-06-11.csv
7. ALICEUSDT-1h-2021-06-12.csv
.
.
.
253. ALICEUSDT-1h-2022-02-13.csv

Each of those files contains the hourly price action of a particular asset, having in total 24 rows (no column names), and therefore, it can be assumed that each filename corresponds to the price action data taken for a particular asset in a particular date.

However, if you look closely at the example above, there are some files missing at the very beginning, which are:

ALICEUSDT-1h-2021-06-04.csv
ALICEUSDT-1h-2021-06-05.csv
ALICEUSDT-1h-2021-06-07.csv
ALICEUSDT-1h-2021-06-08.csv
ALICEUSDT-1h-2021-06-10.csv

This obviously means I could not take into account those files that are previous to the missing files for developing a trading strategy.

So, I would first have to detect which files are missing based on its name, for then defining where to start plotting the price action to avoiding all of the of the possible gaps.

Update: Here's what I have done so far:

import os
import datetime

def check_path(infile):
    return os.path.exists(infile)   

first_entry = input('Tell me the path where your csv files are located at:')

while True:
    
    if check_path(first_entry) == False:
        print('\n')
        print('This PATH is invalid!')
        first_entry = input('Tell me the RIGHT PATH in which your csv files are located: ')
        
    elif check_path(first_entry) == True:
        print('\n')
        final_output = first_entry
        break

for name in os.listdir(first_entry):
    if name.endswith(".csv"):
        print((name.partition('-')[-1]).partition('-')[-1].removesuffix(".csv"))

Output:

2021-06-01
2021-06-02
2021-06-03
2021-06-06
2021-06-09
.
.
.
2022-02-13

Any ideas?

CodePudding user response:

IIUC, you have a list of dates and try to find out what dates are missing if you compare the list against a date range based on min and max date in the list. Sets can help, ex:

import re
from datetime import datetime, timedelta

l = ["ALICEUSDT-1h-2021-06-01.csv",
     "ALICEUSDT-1h-2021-06-02.csv",
     "ALICEUSDT-1h-2021-06-03.csv",
     "ALICEUSDT-1h-2021-06-06.csv",
     "ALICEUSDT-1h-2021-06-09.csv",
     "ALICEUSDT-1h-2021-06-11.csv",
     "ALICEUSDT-1h-2021-06-12.csv"]

# extract the dates, you don't have to use a regex here, it's more for convenience
d = [re.search(r"[0-9]{4}\-[0-9]{2}\-[0-9]{2}", s).group() for s in l]

# to datetime
d = [datetime.fromisoformat(s) for s in d]

# now make a date range based on min and max dates in d
r = [min(d) timedelta(n) for n in range((max(d)-min(d)).days 1)]

# ...so we can do a membership test with sets to find out what is missing...
missing = set(r) - set(d)

sorted(missing)
[datetime.datetime(2021, 6, 4, 0, 0),
 datetime.datetime(2021, 6, 5, 0, 0),
 datetime.datetime(2021, 6, 7, 0, 0),
 datetime.datetime(2021, 6, 8, 0, 0),
 datetime.datetime(2021, 6, 10, 0, 0)]
  • Related