I currently have df1 and df2 below:
df1:
df1 = [('2021-04-29', 'MN02') ,
('2021-04-30', 'MN02'),
('2021-05-01', 'MN02'),
('2021-05-02', 'MN02'),
('2021-05-03', 'MN02'),
('2021-05-04', 'MN02'),
('2021-05-05', 'MN02'),
('2021-05-06', 'MN02'),
('2021-05-07', 'MN02'),
('2021-05-08', 'MN02'),
('2021-05-09', 'MN02')
]
df1 = pd.DataFrame(df1, columns=['Date', 'Group'])
and:
df2:
df2 = [(100, '2021-04-29', 'NA', '2021-04-29', 'MN02'),
(206, '2021-04-30', '2021-04-30', '2021-04-30' , 'MN02') ,
(629, '2021-04-30', '2021-04-30', '2021-04-30', 'MN02'),
(4000, '2021-04-30', '2021-04-30', '2021-04-30', 'MN02'),
(4001, '2021-05-01', '2021-05-01', 'NA' , 'MN02'),
(4002, '2021-05-01', '2021-05-01', '2021-05-01', 'MN02'),
(4003, '2021-05-03', '2021-05-03', '2021-05-03', 'MN02'),
(4004, '2021-05-03', '2021-05-03', '2021-05-03', 'MN02'),
(4005, '2021-05-03', '2021-05-03', '2021-05-06', 'MN02'),
(4006, '2021-05-03', '2021-05-03', '2021-05-06', 'MN02'),
(4007, '2021-05-04', '2021-05-04', '2021-05-04', 'MN02'),
(4008, '2021-05-04', '2021-05-04', '2021-05-04', 'MN02')
]
df2 = pd.DataFrame(df2, columns=['ClaimID', 'New', 'Approved', 'Processed', 'Group'])
Desired output:
df3 = [('2021-04-29', 1, 0, 1, 'MN02'),
('2021-04-30', 3, 3, 3, 'MN02'),
('2021-05-01', 2, 2, 1, 'MN02'),
('2021-05-02', 0, 1, 0, 'MN02'),
('2021-05-03', 4, 5, 2, 'MN02'),
('2021-05-04', 2, 5, 2, 'MN02'),
('2021-05-05', 0, 3, 0, 'MN02'),
('2021-05-06', 0, 3, 2, 'MN02'),
('2021-05-07', 0, 1, 0, 'MN02'),
('2021-05-08', 0, 1, 0, 'MN02'),
('2021-05-09', 0, 1, 0, 'MN02')
]
df3 = pd.DataFrame(df3, columns=['Date', 'New', 'Approved', 'Processed', 'Group'])
A few conditions that this follows:
If a claim is approved and processed on the same day then both, approved and processed, in df3 should show the number of approved and processed on that day and then reset to 0 or the number of app/proc on the next day. For example, on 2021-04-30, there were 3 approved and processed on the same day in df2 so df3 shows 3 and 3, respectively and then they reset to the number app/proc on the next day.
If more are approved than processed on a certain day, the remaining approved should carry over to the next day. For example, on 2021-05-01 2 were approved but only 1 was processed. In df2, we can see that the claim was in fact never processed (NA) so that remainder of 1 should stay for the rest of df3.
If there are multiple approved claims and they are processed on different days then the claims processed later than the approved date should carry over until they reach the processed date. For example, 2 approved claims on 2021-05-03 are processed on the same day and 2 are process on 2021-05-06. Those 2 processed on 2021-05-06 should carry over until Approved reaches 2021-05-06.
Any help would be appreciated and I can clarify as needed.
CodePudding user response:
def setter(series):
res = series.value_counts().reset_index().replace('NA', np.nan).dropna()
res['Date'] = pd.to_datetime(res['index'])
return res.drop(columns='index')
approved = setter(df.Approved)
processed = setter(df.Processed)
data = pd.merge(approved, processed, on='Date', how='outer').fillna(0).sort_values('Date')
data = pd.merge(pd.to_datetime(df1.Date), data, on='Date', how='outer').fillna(0)# remove [1:]
datacum = data[['Approved','Processed']].cumsum()
# datacum = datacum[datacum.Approved != 0]
data['Approved'] = (datacum.Approved - datacum.Processed).shift().fillna(0)
# data.fillna(0)
output:
Date Approved Processed
0 2021-04-29 0.0 1.0
1 2021-04-30 3.0 3.0
2 2021-05-01 2.0 1.0
3 2021-05-02 1.0 0.0
4 2021-05-03 5.0 2.0
5 2021-05-04 5.0 2.0
6 2021-05-05 3.0 0.0
7 2021-05-06 3.0 2.0
8 2021-05-07 1.0 0.0
9 2021-05-08 1.0 0.0
10 2021-05-09 1.0 0.0