Counting consecutive datetimes-CodePudding

I'm having trouble counting the total Record IDs where Start and End for each row are consecutive. Consecutive means when a row starts before the previous row ends and Name == Name. Record IDs 1-3 are consecutive because they overlap and have consecutive start/end datetimes.

I only want to display TRUE where total consecutive conflicts > = 3, else FALSE.

Sample Data:

Record ID	Record Name	Record Start	Record End
1	SMITH, JOHN	10/20/20 8:00 AM	10/20/20 9:30 AM
2	SMITH, JOHN	10/20/20 9:20 AM	10/20/20 10:30 AM
3	SMITH, JOHN	10/20/20 10:20 AM	10/20/20 11:00 AM
4	COOPER, ALLEN	10/20/20 1:00 PM	10/20/20 2:15 PM
5	PEREZ, HILL	10/20/20 3:15 PM	10/20/20 4:00 PM

Current Output:

Record ID	Record Name	Record Start	Record End	Overlap?	Total Records Consecutive >=3
1	SMITH, JOHN	10/20/20 8:00 AM	10/20/20 9:30 AM	TRUE	FALSE
2	SMITH, JOHN	10/20/20 9:20 AM	10/20/20 10:30 AM	TRUE	FALSE
3	SMITH, JOHN	10/20/20 10:20 AM	10/20/20 11:00 AM	TRUE	FALSE
4	SMITH, JOHN	10/20/20 1:00 PM	10/20/20 2:15 PM	TRUE	FALSE
5	SMITH, JOHN	10/20/20 2:00 PM	10/20/20 4:00 PM	TRUE	FALSE

Code Below:

import pandas as pd
import io

#SAMPLE DATA 1 DF
df = pd.read_csv(io.StringIO("""
Record ID;Record Name;Record Start;Record End
1;SMITH, JOHN;10/20/20 8:00 AM;10/20/20 9:30 AM
2;SMITH, JOHN;10/20/20 9:20 AM;10/20/20 10:30 AM
3;SMITH, JOHN;10/20/20 10:20 AM;10/20/20 11:00 AM
4;SMITH, JOHN ;10/20/20 1:00 AM;10/20/20 2:15 PM
5;SMITH, JOHN;10/20/20 2:00 PM;10/20/20 4:00 PM
"""),sep=';')

# SAMPLE DATA 2 DF
df = pd.read_csv(io.StringIO("""
Record ID;Record Name;Record Start;Record End
1;SMITH, JOHN;10/4/20 8:00 AM;10/20/20 9:30 AM
2;SMITH, JOHN;10/4/20 9:20 AM;10/20/20 10:30 AM
3;SMITH, JOHN;10/4/20 11:20 AM;10/20/20 12:00 PM
4;SMITH, JOHN ;10/4/20 1:00 PM;10/20/20 2:15 PM
5;SMITH, JOHN;10/4/20 3:15 PM;10/20/20 4:00 PM
"""),sep=';')

df['Start'] = df['Record Start']
df['End'] = df['Record End']

df['Start'] = pd.to_datetime(df['Start'], errors='coerce')
df['End'] = pd.to_datetime(df['End'], errors='coerce')

df['overlap?'] =  False

for i, row in df.iterrows():
    start, end = "Start", "End"
    df.loc[(df['Record Name'] == row['Record Name'])
            & (((df[start] > row[start]) & (df[start] < row[end])) |
               ((df[end] > row[start]) & (df[end] < row[end])) |
               ((row[end] > df[start]) & (row[end] < df[end])) |
               ((row[start] > df[start]) & (row[start] < df[end]))), 'overlap?'] = True

    df['consecutive overlap?'] = df.groupby(['Record Name', 'Record Start'])['overlap?'].transform('count')

print(df)

Expected Output for Sample Data 1:
   Record ID     Record Name  ... overlap? total records consec >=3?
0          1     SMITH, JOHN  ...     True                    True
1          2     SMITH, JOHN  ...     True                    True
2          3     SMITH, JOHN  ...     True                    True
3          4    SMITH, JOHN   ...     True                    False
4          5     SMITH, JOHN  ...     True                    False



Expected Output for Sample Data 2:

   Record ID   Record Name  ... overlap? total records consec >=3?
0          1   SMITH, JOHN  ...     True                    False
1          2   SMITH, JOHN  ...     True                    False
2          3   SMITH, JOHN  ...     True                    False
3          4  SMITH, JOHN   ...    False                    False
4          5   SMITH, JOHN  ...     True                    False

This gives a false positive. Its just grouping by Name and Date and counting the number of overlaps. But doesn't look at if those overlaps are consecutive.

CodePudding user response：

use dataframe.loc to get the current row and previous row then add one to the previous row count column if the date is equal else if not equal then set count to 1. filter all rows in the dataframe with count greater than 3. You can build a running total based on the name and date also.

I used timedelta frequently in my solution. I created a start end of day variable, a end start of day variable to find the minute date times for the start and end day. Next I find the days between the start and end day and create their minute datetimes.

df = pd.read_csv(io.StringIO("""
Record ID;Record Name;Record Start;Record End
1;SMITH, JOHN;10/20/20 8:00 AM;10/20/20 9:30 AM
2;SMITH, JOHN;10/20/20 9:20 AM;10/20/20 10:30 AM
3;SMITH, JOHN;10/20/20 10:20 AM;10/20/20 11:00 AM
4;COOPER, ALLEN;10/20/20 1:00 PM;10/20/20 2:15 PM
5;PEREZ, HILL;10/20/20 3:15 PM;10/20/20 4:00 PM
6;SMITH, JOHN;10/4/21 8:00 AM;10/20/21 9:30 AM
7;SMITH, JOHN;10/4/21 9:20 AM;10/20/21 10:30 AM
8;SMITH, JOHN;10/4/21 11:20 AM;10/20/21 12:00 PM
9;SMITH, JOHN ;10/4/21 1:00 PM;10/20/21 2:15 PM
10;SMITH, JOHN;10/4/21 3:15 PM;10/20/21 4:00 PM
"""),sep=';')

df['Record Start']=pd.to_datetime(df['Record Start'])
df['Record End']=pd.to_datetime(df['Record End'])
def create_datetime(date,hour,minute,second):
    month=date.month
    day=date.day
    year=date.year
    #month = int(txt.split("/")[0])
    #day = int(txt.split("/")[1])
    #year = int("20" txt.split("/")[2].split(" ")[0])
    # datetime(year, month, day, hour, minute, second, microsecond)
    return datetime(year=year,month=month,day=day,hour=hour,minute=minute,second=second,microsecond=0)
def get_minutes(row):
    #start = datetime.strptime(row['Record Start'], '%m/%d/%y %H:%M %p')
    start=row['Record Start']
    start_end_of_day=create_datetime(row['Record Start'],23,59,59)

    #end = datetime.strptime(row['Record End'], '%m/%d/%y %H:%M %p')
    end = row['Record End']
    end_beginning_of_day=create_datetime(row['Record End'], 0,0,0)        
    num_of_days=(end-start).days
    if num_of_days==1:
        results=[start   timedelta(minutes=x) for x in range(0, round((end-start).total_seconds()//60) 1)]
    elif num_of_days==2:
        results=[start   timedelta(minutes=x) for x in range(0, round((start_end_of_day-start).total_seconds()//60) 1)]
        results =[end_beginning_of_day  timedelta(minutes=x) for x in range(0, round((end_beginning_of_day-end).total_seconds()//60.0))]
    else:
        results=[start   timedelta(minutes=x) for x in range(0, round((start_end_of_day-start).total_seconds()//60) 1)]
        #find the minutes between the start and end not include start and end in the range
        dates=pd.date_range(start timedelta(days=1),end-timedelta(days=1))
        for date in dates:
            date=date.date()
            date=pd.to_datetime(date) timedelta(hours=0,minutes=0,seconds=0)
            results =[date timedelta(minutes=x) for x in range(0, 59 1)]
        results =[end_beginning_of_day  timedelta(minutes=x) for x in range(0, round((end-end_beginning_of_day).total_seconds()//60) 1)]
    
    #for item in results:
    #    print(item)
    #sys.exit()
    return results

df['minutes'] = df.apply(get_minutes, axis=1)

def intersection(lst1, lst2):
    return list(set(lst1) & set(lst2))

prev_row=None
for index,row in df.iterrows():
    if index==0:
        df.loc[index,'Count']=1
    else:
        prev_row=df.iloc[index-1]
        
    if not prev_row is None:
        if prev_row['Record Name']==row['Record Name']:
            count=prev_row['Count']
            lst1=row['minutes']
            lst2=prev_row['minutes']
            if len(intersection(lst1,lst2))>0:
                df.loc[index,'Count']=count 1
            else:
                df.loc[index,'Count']=1
        else:
            df.loc[index,'Count']=1
        
    #print(df[df['Count']>=3])   
    print(df)

output:

 Record ID    Record Name       Record Start         Record End  Count
 0          1    SMITH, JOHN   10/20/20 8:00 AM   10/20/20 9:30 AM         1.0
 1          2    SMITH, JOHN   10/20/20 9:20 AM  10/20/20 10:30 AM    2.0
 2          3    SMITH, JOHN  10/20/20 10:20 AM  10/20/20 11:00 AM    3.0
 3          4  COOPER, ALLEN   10/20/20 1:00 PM   10/20/20 2:15 PM    1.0
 4          5    PEREZ, HILL   10/20/20 3:15 PM   10/20/20 4:00 PM    1.0
 5          6    SMITH, JOHN    10/4/21 8:00 AM   10/20/21 9:30 AM    1.0
 6          7    SMITH, JOHN    10/4/21 9:20 AM  10/20/21 10:30 AM    2.0
 7          8    SMITH, JOHN   10/4/21 11:20 AM  10/20/21 12:00 PM    3.0
 8          9   SMITH, JOHN     10/4/21 1:00 PM   10/20/21 2:15 PM    1.0
 9         10    SMITH, JOHN    10/4/21 3:15 PM   10/20/21 4:00 PM    1.0