Tracking claims using date/timestamp columns and creating a final count using pandas-CodePudding

I have an issue where I need to track the progression of patients insurance claim statuses based on the dates of those statuses. I also need to create a count of status based on certain conditions.

DF:

ClaimID	New	Accepted	Denied	Expired	Group
001	2021-01-01T09:58:35:335Z	2021-01-01T10:05:43:000Z			A
002	2021-01-01T06:30:30:000Z	2021-03-01T04:11:45:000Z	2021-03-01T04:11:53:000Z		A
003	2021-02-14T14:23:54:154Z	2021-02-15T11:11:56:000Z		2021-02-15T11:15:00:000Z	A
004	2021-02-14T15:36:05:335Z	2021-02-14T17:15:30:000Z			A
005	2021-02-14T15:56:59:009Z	2021-03-01T10:05:43:000Z			A

In the above dataset, we have 6 columns. ClaimID is simple and just indicates the ID of the claim. New, Accepted, Denied, Pending, and Expired indicate the status of the claim and the day/time those statuses were set.

What I need to do is get a count of how many claims are New on each day and how many move out of new into a new status. For example, There are 2 new claims on 2021-01-01. On that same day 1 moved to Accepted about 7 minutes later. Thus on 2021-01-01 the table of counts would read:

DF_Count:

Date	New	Accepted	Denied	Pending	Expired
2021-01-01	2	1	0	0	0
2021-01-02	1	0	0	0	0
2021-01-03	1	0	0	0	0
2021-01-04	1	0	0	0	0
2021-01-05	1	0	0	0	0
....	....	....	....	....	....
2021-02-14	4	2	0	0	0
2021-02-15	2	3	0	0	1
2021-02-16	2	2	0	0	0

Few Conditions:

If a claim moves from one status to the other on the same day (even if they are a minutes/hours apart) it would not be subtracted from the original status until the next day. This can be seen on 2021-01-01 where claim 001 moves from new to accepted on the same day but the claim is not subtracted from new until 2021-01-02.
Until something happens to a claim, it should remain in its original status. Claim 002 will remain in new until 2021-03-01 when it is approved.
If a claim changes status on a later date than its original status, it will be subtracted on that later date. For this, see status 003. It is new on 2/14 but accepted on 2/15. This is why New goes down by 2 on 2/15 (the other claim is the is 004 which is new and accepted on the same day)
For certain statuses, I do not need to look at all columns. For example, For new I only look at the dates inside Accepted and Denied. Not Pending and Expired. When I do these same steps for approved, I no longer need to look at new, just the other columns. How would I do that?
In the final DF_count table, the dates should start from the earliest date in 'New' and end on todays date.
The code needs to be grouped by the Group Column as well. For example, patients in group B (not pictured) will have to have the same start and end date but for their own claims.
I need to do this separately for all of the statuses. Not just new.

Current Solution:

My current solution has been to create an dataset with just dates from the min New Date to todays date. Then for each column, what I do is use the .loc method to find dates that are greater than New in each of the other columns. For example, in the code below I look for all cases where new is equal to approved.

df1 = df.loc[(df['New'] == df['Approved']) & 
((df['Expired'].isnull()) | (df['Expired'] >= df['Accepted'])) &
((df['Pending'].isnull()) | (df['Pending'] >= df['Accepted'])) &
((df['Denied'].isnull()) | (df['Denied'] >= df['Accepted']))]

newtoaccsday = df1.loc[:, ('Group', 'Accepted')]
newtoappsday['Date'] = newtoappsday['Accepted'] 
newtoappsday = newtoappsday.reset_index(drop = True)
newtoappsday= newtoappsday.groupby(['Date', 'Group'], as_index = False)['Approved'].value_counts()
newtoappsday.drop(columns = {'Accepted'}, inplace = True)
newtoappsday.rename(columns = {'count': 'NewAppSDay'}, inplace = True)
newtoappsday['Date'] = newtoappsday['Date']   timedelta(1)
df_count= df_count.merge(newtoappsday, how = 'left', on = ['Date', 'Group']).fillna(0)


--After doing the above steps for all conditions (where new goes to accepted on a later date etc.) I will do the final calculation for new:

df_count['New'] = df_count.eval('New = New - (NewAccSDay   NewAccLater   NewDenSDay   NewDenLater   NewExpLater   NewPendSDay   NewPendLater)').groupby(['Tier2_ID', 'ClaimType'])['New'].cumsum()

Any and all help would be greatly appreciated. My method above is extremely inefficient and leading to some errors. Do I need to write a for loop for this? What is the best way to go about this.

CodePudding user response：

I think this is what you want or can be easily modified to your necesity:

import pandas as pd
import numpy as np
from datetime import timedelta
from datetime import date

def dateRange(d1,d2):
    return [d1   timedelta(days=x) for x in range((d2-d1).days)]
    
def addCount(dic,group,dat,cat):
    if group not in dic:
        dic[group]={}
    if dat not in dic[group]:
        dic[group][dat]={}
    if cat not in dic[group][dat]:
        dic[group][dat][cat]=0
    dic[group][dat][cat] =1
    
df =pd.read_csv("testdf.csv",
                parse_dates=["New","Accepted","Denied","Pending", "Expired"])#,

cdic={}
for i,row in df.iterrows():
    cid=row["ClaimID"]
    dnew=row["New"].date()
    dacc=row["Accepted"].date()
    dden=row["Denied"].date()
    dpen=row["Pending"].date()
    dexp=row["Expired"].date()
    group=row["Group"]
    
    if not pd.isna(dacc): #Claim has been accepted
        if(dnew == dacc):
            dacc =timedelta(days=1)
        nend=dacc
        addCount(cdic,group,dacc,"acc")
    if not pd.isna(dden): # Claim has been denied
        if(dnew == dden):
            dden =timedelta(days=1)
        if pd.isna(dacc):
            nend=dden
        addCount(cdic,group,dden,"den")
    if not pd.isna(dpen):
        addCount(cdic,group,dpen,"pen") # Claim is pending
    if not pd.isna(dexp):
        addCount(cdic,group,dexp,"exp") # Claim is expired
    if pd.isna(dacc) and pd.isna(dden):
        nend=date.today() timedelta(days 1)
    for d in dateRange(dnew,nend):  # Fill new status until first change
        addCount(cdic,group,d,"new")
ndfl=[]            
for group in cdic:
    for dat in sorted(cdic[group].keys()):
        r=cdic[group][dat]
        ndfl.append([group,dat,r.get("new",0),r.get("acc",0),
              r.get("den",0),r.get("pen",0),r.get("exp",0)])
ndf=pd.DataFrame(ndfl,columns=["Group", "Date","New","Accepted","Denied","Pending","Expired"])

CodePudding user response：

First convert the date columns with something like

for i in ['New', 'Accepted', 'Denied', 'Pending', 'Expired']:
      df[i] = pd.to_datetime(df[i], format="%Y-%m-%dT%H:%M:%S:%f%z")

Then develop the date range applicable based on your column conditions. In this logic if Denied is there the range is new --> denied, or if accepted new --> accepted or if no acceptance new --> today with code like (alter as per rules):

df['new_range'] = df[['New','Accepted','Denied']].apply (lambda x: pd.date_range(x['New'],x['Denied']).date.tolist() if 
                                            pd.notnull(x['Denied']) else 
                                            pd.date_range(x['New'],x['Accepted']).date.tolist() if 
                                            pd.notnull(x['Accepted']) else
                                            pd.date_range(x['New'],datetime.today()).date.tolist()
                                            ,axis=1)

You should be able filter on a group and see date ranges in your df like:

    df[df['Group']=='A']['new_range']
0                                         [2021-01-01]
1    [2021-01-01, 2021-01-02, 2021-01-03, 2021-01-0...
2                                         [2021-02-14]
3                                         [2021-02-14]
4    [2021-02-14, 2021-02-15, 2021-02-16, 2021-02-1..

Then you can explode the date ranges and group on counts to get the new counts for each day with code like:

        new = pd.to_datetime(df[df['Group']=='A']['new_range'].explode('Date')).reset_index()
    
    newc = new.groupby('new_range').count()
    newc

new_range
2021-01-01    2
2021-01-02    1
2021-01-03    1
2021-01-04    1
2021-01-05    1
2021-01-06    1...

Similarly get counts for accepted, denied and then left joined on date to arrive at final table, fill na to 0.

By creating your rules to expand your date range, then explode over date range and groupby to get your counts you should be able to avoid much of the expensive operation.