Home > Software design >  Find missing dates in multiple date ranges for multi groups
Find missing dates in multiple date ranges for multi groups

Time:11-04

Im trying to extract missing date list for a date range in columns DATE FROM and DATE TO for multiple groups in column CURRENCY, the ranges is splited in multiple rows for each group column CURRENCY :

For example : Currency EUR have three date ranges in row 0, 1 and 2, and the missing range in the entire group is 2021-10-06 to 2021-10-10.

Missing ranges is highlited below using * symbol, This missing date range is what i desire in expected output.

df = pd.DataFrame({"CURRENCY":{"0":"EUR","1":"EUR","2":"EUR","3":"GBP","4":"GBP","5":"GBP","6":"USD","7":"USD","8":"SAR","9":"SAR"},
                   "DATE FROM":{"0":"2021-10-01","1":"2021-10-11","2":"2021-10-19","3":"2021-10-01","4":"2021-10-05",
                                "5":"2021-10-11","6":"2021-10-01","7":"2021-10-05","8":"2021-10-01","9":"2021-10-05"},
                   "DATE TO":{"0":"2021-10-05","1":"2021-10-18","2":"2021-10-23","3":"2021-10-04","4":"2021-10-07",
                              "5":"2021-10-18","6":"2021-10-02","7":"2021-10-10","8":"2021-10-01","9":"2021-10-10"}})

    CURRENCY    DATE FROM   DATE TO
0   EUR         2021-10-01  2021-10-05*
1   EUR         2021-10-11* 2021-10-18
2   EUR         2021-10-19  2021-10-23
3   GBP         2021-10-01  2021-10-04
4   GBP         2021-10-05  2021-10-07*
5   GBP         2021-10-11* 2021-10-18
6   USD         2021-10-01  2021-10-02*
7   USD         2021-10-05* 2021-10-10
8   SAR         2021-10-01  2021-10-01*
9   SAR         2021-10-05* 2021-10-10

Expected output:

    CURRENCY    MISSING
0   EUR         2021-10-06
1   EUR         2021-10-07
2   EUR         2021-10-08
3   EUR         2021-10-09
4   EUR         2021-10-10
5   GBP         2021-10-08
6   GBP         2021-10-09
7   GBP         2021-10-10
8   USD         2021-10-03
9   USD         2021-10-04
10  SAR         2021-10-02
11  SAR         2021-10-03
12  SAR         2021-10-04

Below is what i have tried, but it seems this is not efficient way to resolve, also im not sure how to group those output ranges to find the missing dates for each group (EUR,GBP...etc) :

date_from_list = df['DATE FROM'].to_list()
date_to_list = df['DATE TO'].to_list()
curr_list = df['CURRENCY'].to_list()

for date_from, date_to, curr in zip(date_from_list, date_to_list, curr_list):
    print(curr_list, pd.date_range(date_from, date_to))

CodePudding user response:

Both pd.date_range and pd.period_range can do. I have used date range which only has close either right or left, so you may have to do some filter. Code below

df= df.assign(end=df['DATE FROM'].shift(-1),start=df['DATE TO']).iloc[:-1 , :]#Define the start and end for date range

df=df.assign(Missing=df.apply(lambda x: pd.date_range(start=x['start'],  end=x['end'], closed='right').tolist(), axis = 1)).explode('Missing').drop_duplicates('Missing').drop(['start','end'],axis=1)

  CURRENCY   DATE FROM     DATE TO    Missing
0      EUR  2021-10-01  2021-10-05 2021-10-06
0      EUR  2021-10-01  2021-10-05 2021-10-07
0      EUR  2021-10-01  2021-10-05 2021-10-08
0      EUR  2021-10-01  2021-10-05 2021-10-09
0      EUR  2021-10-01  2021-10-05 2021-10-10
0      EUR  2021-10-01  2021-10-05 2021-10-11
1      EUR  2021-10-11  2021-10-18 2021-10-19
2      EUR  2021-10-19  2021-10-23        NaT
3      GBP  2021-10-01  2021-10-04 2021-10-05
6      USD  2021-10-01  2021-10-02 2021-10-03
6      USD  2021-10-01  2021-10-02 2021-10-04
8      SAR  2021-10-01  2021-10-01 2021-10-02

CodePudding user response:

You can use the following code to solve your problem, which is comparing lists of the total ranges for each currency with list of the individual ranges of each row. The filtering with the nested loops can likely be optimized.

import numpy as np
df['dateranges'] = df.apply(lambda x: pd.date_range(x['DATE FROM'],x['DATE TO']), axis=1)
total_ranges = (df.groupby("CURRENCY")
                .apply(lambda x: pd.date_range(x['DATE FROM'].min(),x['DATE TO'].max())))
individual_ranges = df.groupby("CURRENCY")['dateranges'].apply(list).apply(np.concatenate)

currency = []
missing = []
for curr in total_ranges.index:
    for date in total_ranges[curr]:
        if date not in individual_ranges[curr]:
            currency.append(curr)
            missing.append(date)

result = pd.DataFrame({'CURRENCY': currency,
                       'MISSING':missing})

CodePudding user response:

This solution uses a package called piso, (pandas interval set operations) and will be fast.

setup

df = pd.DataFrame({"CURRENCY":{"0":"EUR","1":"EUR","2":"EUR","3":"GBP","4":"GBP","5":"GBP","6":"USD","7":"USD","8":"SAR","9":"SAR"},
               "DATE FROM":{"0":"2021-10-01","1":"2021-10-11","2":"2021-10-19","3":"2021-10-01","4":"2021-10-05",
                            "5":"2021-10-11","6":"2021-10-01","7":"2021-10-05","8":"2021-10-01","9":"2021-10-05"},
               "DATE TO":{"0":"2021-10-05","1":"2021-10-18","2":"2021-10-23","3":"2021-10-04","4":"2021-10-07",
                          "5":"2021-10-18","6":"2021-10-02","7":"2021-10-10","8":"2021-10-01","9":"2021-10-10"}})

df["DATE FROM"] = pd.to_datetime(df["DATE FROM"])
df["DATE TO"] = pd.to_datetime(df["DATE TO"])   pd.Timedelta("1d")

Note I am changing data to be pandas.Timestamp and adding a day to the end dates. This is because you seem to be interpreting dates as periods, i.e. an entire day, where we will be working with instants in time.

solution

Groupby currency and create an interval index for each currency

interval_arrays = df.groupby("CURRENCY").apply(lambda d: pd.IntervalIndex.from_arrays(d["DATE FROM"], d["DATE TO"]))

interval_arrays looks like this

CURRENCY
EUR    IntervalIndex([(2021-10-01, 2021-10-06], (2021...
GBP    IntervalIndex([(2021-10-01, 2021-10-05], (2021...
SAR    IntervalIndex([(2021-10-01, 2021-10-02], (2021...
USD    IntervalIndex([(2021-10-01, 2021-10-03], (2021...
dtype: object

Create the missing intervals, as pandas.IntervalIndex using piso.complement.

import piso
missing = interval_arrays.apply(piso.complement)

missing looks like

CURRENCY
EUR    IntervalIndex([(2021-10-06, 2021-10-11]], ...
GBP    IntervalIndex([(2021-10-08, 2021-10-11]], ... 
SAR    IntervalIndex([(2021-10-02, 2021-10-05]], ...
USD    IntervalIndex([(2021-10-03, 2021-10-05]], ...
dtype: object

Next we need to convert the intervals in each IntervalIndex into date ranges and combine the date ranges into a single pandas.DatetimeIndex which gets wrapped up in a dataframe

def calc(currency, interval_index):
    date_ranges = [pd.date_range(i.left, i.right, closed="left") for i in interval_index]
    combined = date_ranges[0].union_many(date_ranges[1:])
    return pd.DataFrame({"MISSING":combined}).assign(CURRENCY=currency)

pd.concat([calc(currency, interval_index) for currency, interval_index in missing.iteritems() if len(interval_index) > 0])

The result is the following dataframe

    MISSING CURRENCY
0 2021-10-06      EUR
1 2021-10-07      EUR
2 2021-10-08      EUR
3 2021-10-09      EUR
4 2021-10-10      EUR
0 2021-10-08      GBP
1 2021-10-09      GBP
2 2021-10-10      GBP
0 2021-10-02      SAR
1 2021-10-03      SAR
2 2021-10-04      SAR
0 2021-10-03      USD
1 2021-10-04      USD
  • Related