Im trying to extract missing date list for a date range in columns DATE FROM
and DATE TO
for multiple groups in column CURRENCY
, the ranges is splited in multiple rows for each group column CURRENCY
:
For example : Currency EUR have three date ranges in row 0, 1 and 2, and the missing range in the entire group is 2021-10-06 to 2021-10-10.
Missing ranges is highlited below using *
symbol, This missing date range is what i desire in expected output.
df = pd.DataFrame({"CURRENCY":{"0":"EUR","1":"EUR","2":"EUR","3":"GBP","4":"GBP","5":"GBP","6":"USD","7":"USD","8":"SAR","9":"SAR"},
"DATE FROM":{"0":"2021-10-01","1":"2021-10-11","2":"2021-10-19","3":"2021-10-01","4":"2021-10-05",
"5":"2021-10-11","6":"2021-10-01","7":"2021-10-05","8":"2021-10-01","9":"2021-10-05"},
"DATE TO":{"0":"2021-10-05","1":"2021-10-18","2":"2021-10-23","3":"2021-10-04","4":"2021-10-07",
"5":"2021-10-18","6":"2021-10-02","7":"2021-10-10","8":"2021-10-01","9":"2021-10-10"}})
CURRENCY DATE FROM DATE TO
0 EUR 2021-10-01 2021-10-05*
1 EUR 2021-10-11* 2021-10-18
2 EUR 2021-10-19 2021-10-23
3 GBP 2021-10-01 2021-10-04
4 GBP 2021-10-05 2021-10-07*
5 GBP 2021-10-11* 2021-10-18
6 USD 2021-10-01 2021-10-02*
7 USD 2021-10-05* 2021-10-10
8 SAR 2021-10-01 2021-10-01*
9 SAR 2021-10-05* 2021-10-10
Expected output:
CURRENCY MISSING
0 EUR 2021-10-06
1 EUR 2021-10-07
2 EUR 2021-10-08
3 EUR 2021-10-09
4 EUR 2021-10-10
5 GBP 2021-10-08
6 GBP 2021-10-09
7 GBP 2021-10-10
8 USD 2021-10-03
9 USD 2021-10-04
10 SAR 2021-10-02
11 SAR 2021-10-03
12 SAR 2021-10-04
Below is what i have tried, but it seems this is not efficient way to resolve, also im not sure how to group those output ranges to find the missing dates for each group (EUR,GBP...etc) :
date_from_list = df['DATE FROM'].to_list()
date_to_list = df['DATE TO'].to_list()
curr_list = df['CURRENCY'].to_list()
for date_from, date_to, curr in zip(date_from_list, date_to_list, curr_list):
print(curr_list, pd.date_range(date_from, date_to))
CodePudding user response:
Both pd.date_range and pd.period_range can do. I have used date range which only has close either right or left, so you may have to do some filter. Code below
df= df.assign(end=df['DATE FROM'].shift(-1),start=df['DATE TO']).iloc[:-1 , :]#Define the start and end for date range
df=df.assign(Missing=df.apply(lambda x: pd.date_range(start=x['start'], end=x['end'], closed='right').tolist(), axis = 1)).explode('Missing').drop_duplicates('Missing').drop(['start','end'],axis=1)
CURRENCY DATE FROM DATE TO Missing
0 EUR 2021-10-01 2021-10-05 2021-10-06
0 EUR 2021-10-01 2021-10-05 2021-10-07
0 EUR 2021-10-01 2021-10-05 2021-10-08
0 EUR 2021-10-01 2021-10-05 2021-10-09
0 EUR 2021-10-01 2021-10-05 2021-10-10
0 EUR 2021-10-01 2021-10-05 2021-10-11
1 EUR 2021-10-11 2021-10-18 2021-10-19
2 EUR 2021-10-19 2021-10-23 NaT
3 GBP 2021-10-01 2021-10-04 2021-10-05
6 USD 2021-10-01 2021-10-02 2021-10-03
6 USD 2021-10-01 2021-10-02 2021-10-04
8 SAR 2021-10-01 2021-10-01 2021-10-02
CodePudding user response:
You can use the following code to solve your problem, which is comparing lists of the total ranges for each currency with list of the individual ranges of each row. The filtering with the nested loops can likely be optimized.
import numpy as np
df['dateranges'] = df.apply(lambda x: pd.date_range(x['DATE FROM'],x['DATE TO']), axis=1)
total_ranges = (df.groupby("CURRENCY")
.apply(lambda x: pd.date_range(x['DATE FROM'].min(),x['DATE TO'].max())))
individual_ranges = df.groupby("CURRENCY")['dateranges'].apply(list).apply(np.concatenate)
currency = []
missing = []
for curr in total_ranges.index:
for date in total_ranges[curr]:
if date not in individual_ranges[curr]:
currency.append(curr)
missing.append(date)
result = pd.DataFrame({'CURRENCY': currency,
'MISSING':missing})
CodePudding user response:
This solution uses a package called piso
, (pandas interval set operations) and will be fast.
setup
df = pd.DataFrame({"CURRENCY":{"0":"EUR","1":"EUR","2":"EUR","3":"GBP","4":"GBP","5":"GBP","6":"USD","7":"USD","8":"SAR","9":"SAR"},
"DATE FROM":{"0":"2021-10-01","1":"2021-10-11","2":"2021-10-19","3":"2021-10-01","4":"2021-10-05",
"5":"2021-10-11","6":"2021-10-01","7":"2021-10-05","8":"2021-10-01","9":"2021-10-05"},
"DATE TO":{"0":"2021-10-05","1":"2021-10-18","2":"2021-10-23","3":"2021-10-04","4":"2021-10-07",
"5":"2021-10-18","6":"2021-10-02","7":"2021-10-10","8":"2021-10-01","9":"2021-10-10"}})
df["DATE FROM"] = pd.to_datetime(df["DATE FROM"])
df["DATE TO"] = pd.to_datetime(df["DATE TO"]) pd.Timedelta("1d")
Note I am changing data to be pandas.Timestamp
and adding a day to the end dates. This is because you seem to be interpreting dates as periods, i.e. an entire day, where we will be working with instants in time.
solution
Groupby currency and create an interval index for each currency
interval_arrays = df.groupby("CURRENCY").apply(lambda d: pd.IntervalIndex.from_arrays(d["DATE FROM"], d["DATE TO"]))
interval_arrays
looks like this
CURRENCY
EUR IntervalIndex([(2021-10-01, 2021-10-06], (2021...
GBP IntervalIndex([(2021-10-01, 2021-10-05], (2021...
SAR IntervalIndex([(2021-10-01, 2021-10-02], (2021...
USD IntervalIndex([(2021-10-01, 2021-10-03], (2021...
dtype: object
Create the missing intervals, as pandas.IntervalIndex
using piso.complement
.
import piso
missing = interval_arrays.apply(piso.complement)
missing
looks like
CURRENCY
EUR IntervalIndex([(2021-10-06, 2021-10-11]], ...
GBP IntervalIndex([(2021-10-08, 2021-10-11]], ...
SAR IntervalIndex([(2021-10-02, 2021-10-05]], ...
USD IntervalIndex([(2021-10-03, 2021-10-05]], ...
dtype: object
Next we need to convert the intervals in each IntervalIndex
into date ranges and combine the date ranges into a single pandas.DatetimeIndex
which gets wrapped up in a dataframe
def calc(currency, interval_index):
date_ranges = [pd.date_range(i.left, i.right, closed="left") for i in interval_index]
combined = date_ranges[0].union_many(date_ranges[1:])
return pd.DataFrame({"MISSING":combined}).assign(CURRENCY=currency)
pd.concat([calc(currency, interval_index) for currency, interval_index in missing.iteritems() if len(interval_index) > 0])
The result is the following dataframe
MISSING CURRENCY
0 2021-10-06 EUR
1 2021-10-07 EUR
2 2021-10-08 EUR
3 2021-10-09 EUR
4 2021-10-10 EUR
0 2021-10-08 GBP
1 2021-10-09 GBP
2 2021-10-10 GBP
0 2021-10-02 SAR
1 2021-10-03 SAR
2 2021-10-04 SAR
0 2021-10-03 USD
1 2021-10-04 USD