I'm having trouble counting the total Record IDs where Start and End for each row are consecutive. Consecutive means when a row starts before the previous row ends and Name == Name. Record IDs 1-3 are consecutive because they overlap and have consecutive start/end datetimes.
I only want to display TRUE where total consecutive conflicts > = 3, else FALSE.
Sample Data:
Record ID | Record Name | Record Start | Record End |
---|---|---|---|
1 | SMITH, JOHN | 10/20/20 8:00 AM | 10/20/20 9:30 AM |
2 | SMITH, JOHN | 10/20/20 9:20 AM | 10/20/20 10:30 AM |
3 | SMITH, JOHN | 10/20/20 10:20 AM | 10/20/20 11:00 AM |
4 | COOPER, ALLEN | 10/20/20 1:00 PM | 10/20/20 2:15 PM |
5 | PEREZ, HILL | 10/20/20 3:15 PM | 10/20/20 4:00 PM |
Current Output:
Record ID | Record Name | Record Start | Record End | Overlap? | Total Records Consecutive >=3 |
---|---|---|---|---|---|
1 | SMITH, JOHN | 10/20/20 8:00 AM | 10/20/20 9:30 AM | TRUE | FALSE |
2 | SMITH, JOHN | 10/20/20 9:20 AM | 10/20/20 10:30 AM | TRUE | FALSE |
3 | SMITH, JOHN | 10/20/20 10:20 AM | 10/20/20 11:00 AM | TRUE | FALSE |
4 | SMITH, JOHN | 10/20/20 1:00 PM | 10/20/20 2:15 PM | TRUE | FALSE |
5 | SMITH, JOHN | 10/20/20 2:00 PM | 10/20/20 4:00 PM | TRUE | FALSE |
Code Below:
import pandas as pd
import io
#SAMPLE DATA 1 DF
df = pd.read_csv(io.StringIO("""
Record ID;Record Name;Record Start;Record End
1;SMITH, JOHN;10/20/20 8:00 AM;10/20/20 9:30 AM
2;SMITH, JOHN;10/20/20 9:20 AM;10/20/20 10:30 AM
3;SMITH, JOHN;10/20/20 10:20 AM;10/20/20 11:00 AM
4;SMITH, JOHN ;10/20/20 1:00 AM;10/20/20 2:15 PM
5;SMITH, JOHN;10/20/20 2:00 PM;10/20/20 4:00 PM
"""),sep=';')
# SAMPLE DATA 2 DF
df = pd.read_csv(io.StringIO("""
Record ID;Record Name;Record Start;Record End
1;SMITH, JOHN;10/4/20 8:00 AM;10/20/20 9:30 AM
2;SMITH, JOHN;10/4/20 9:20 AM;10/20/20 10:30 AM
3;SMITH, JOHN;10/4/20 11:20 AM;10/20/20 12:00 PM
4;SMITH, JOHN ;10/4/20 1:00 PM;10/20/20 2:15 PM
5;SMITH, JOHN;10/4/20 3:15 PM;10/20/20 4:00 PM
"""),sep=';')
df['Start'] = df['Record Start']
df['End'] = df['Record End']
df['Start'] = pd.to_datetime(df['Start'], errors='coerce')
df['End'] = pd.to_datetime(df['End'], errors='coerce')
df['overlap?'] = False
for i, row in df.iterrows():
start, end = "Start", "End"
df.loc[(df['Record Name'] == row['Record Name'])
& (((df[start] > row[start]) & (df[start] < row[end])) |
((df[end] > row[start]) & (df[end] < row[end])) |
((row[end] > df[start]) & (row[end] < df[end])) |
((row[start] > df[start]) & (row[start] < df[end]))), 'overlap?'] = True
df['consecutive overlap?'] = df.groupby(['Record Name', 'Record Start'])['overlap?'].transform('count')
print(df)
Expected Output for Sample Data 1:
Record ID Record Name ... overlap? total records consec >=3?
0 1 SMITH, JOHN ... True True
1 2 SMITH, JOHN ... True True
2 3 SMITH, JOHN ... True True
3 4 SMITH, JOHN ... True False
4 5 SMITH, JOHN ... True False
Expected Output for Sample Data 2:
Record ID Record Name ... overlap? total records consec >=3?
0 1 SMITH, JOHN ... True False
1 2 SMITH, JOHN ... True False
2 3 SMITH, JOHN ... True False
3 4 SMITH, JOHN ... False False
4 5 SMITH, JOHN ... True False
This gives a false positive. Its just grouping by Name and Date and counting the number of overlaps. But doesn't look at if those overlaps are consecutive.
CodePudding user response:
use dataframe.loc to get the current row and previous row then add one to the previous row count column if the date is equal else if not equal then set count to 1. filter all rows in the dataframe with count greater than 3. You can build a running total based on the name and date also.
I used timedelta frequently in my solution. I created a start end of day variable, a end start of day variable to find the minute date times for the start and end day. Next I find the days between the start and end day and create their minute datetimes.
df = pd.read_csv(io.StringIO("""
Record ID;Record Name;Record Start;Record End
1;SMITH, JOHN;10/20/20 8:00 AM;10/20/20 9:30 AM
2;SMITH, JOHN;10/20/20 9:20 AM;10/20/20 10:30 AM
3;SMITH, JOHN;10/20/20 10:20 AM;10/20/20 11:00 AM
4;COOPER, ALLEN;10/20/20 1:00 PM;10/20/20 2:15 PM
5;PEREZ, HILL;10/20/20 3:15 PM;10/20/20 4:00 PM
6;SMITH, JOHN;10/4/21 8:00 AM;10/20/21 9:30 AM
7;SMITH, JOHN;10/4/21 9:20 AM;10/20/21 10:30 AM
8;SMITH, JOHN;10/4/21 11:20 AM;10/20/21 12:00 PM
9;SMITH, JOHN ;10/4/21 1:00 PM;10/20/21 2:15 PM
10;SMITH, JOHN;10/4/21 3:15 PM;10/20/21 4:00 PM
"""),sep=';')
df['Record Start']=pd.to_datetime(df['Record Start'])
df['Record End']=pd.to_datetime(df['Record End'])
def create_datetime(date,hour,minute,second):
month=date.month
day=date.day
year=date.year
#month = int(txt.split("/")[0])
#day = int(txt.split("/")[1])
#year = int("20" txt.split("/")[2].split(" ")[0])
# datetime(year, month, day, hour, minute, second, microsecond)
return datetime(year=year,month=month,day=day,hour=hour,minute=minute,second=second,microsecond=0)
def get_minutes(row):
#start = datetime.strptime(row['Record Start'], '%m/%d/%y %H:%M %p')
start=row['Record Start']
start_end_of_day=create_datetime(row['Record Start'],23,59,59)
#end = datetime.strptime(row['Record End'], '%m/%d/%y %H:%M %p')
end = row['Record End']
end_beginning_of_day=create_datetime(row['Record End'], 0,0,0)
num_of_days=(end-start).days
if num_of_days==1:
results=[start timedelta(minutes=x) for x in range(0, round((end-start).total_seconds()//60) 1)]
elif num_of_days==2:
results=[start timedelta(minutes=x) for x in range(0, round((start_end_of_day-start).total_seconds()//60) 1)]
results =[end_beginning_of_day timedelta(minutes=x) for x in range(0, round((end_beginning_of_day-end).total_seconds()//60.0))]
else:
results=[start timedelta(minutes=x) for x in range(0, round((start_end_of_day-start).total_seconds()//60) 1)]
#find the minutes between the start and end not include start and end in the range
dates=pd.date_range(start timedelta(days=1),end-timedelta(days=1))
for date in dates:
date=date.date()
date=pd.to_datetime(date) timedelta(hours=0,minutes=0,seconds=0)
results =[date timedelta(minutes=x) for x in range(0, 59 1)]
results =[end_beginning_of_day timedelta(minutes=x) for x in range(0, round((end-end_beginning_of_day).total_seconds()//60) 1)]
#for item in results:
# print(item)
#sys.exit()
return results
df['minutes'] = df.apply(get_minutes, axis=1)
def intersection(lst1, lst2):
return list(set(lst1) & set(lst2))
prev_row=None
for index,row in df.iterrows():
if index==0:
df.loc[index,'Count']=1
else:
prev_row=df.iloc[index-1]
if not prev_row is None:
if prev_row['Record Name']==row['Record Name']:
count=prev_row['Count']
lst1=row['minutes']
lst2=prev_row['minutes']
if len(intersection(lst1,lst2))>0:
df.loc[index,'Count']=count 1
else:
df.loc[index,'Count']=1
else:
df.loc[index,'Count']=1
#print(df[df['Count']>=3])
print(df)
output:
Record ID Record Name Record Start Record End Count
0 1 SMITH, JOHN 10/20/20 8:00 AM 10/20/20 9:30 AM 1.0
1 2 SMITH, JOHN 10/20/20 9:20 AM 10/20/20 10:30 AM 2.0
2 3 SMITH, JOHN 10/20/20 10:20 AM 10/20/20 11:00 AM 3.0
3 4 COOPER, ALLEN 10/20/20 1:00 PM 10/20/20 2:15 PM 1.0
4 5 PEREZ, HILL 10/20/20 3:15 PM 10/20/20 4:00 PM 1.0
5 6 SMITH, JOHN 10/4/21 8:00 AM 10/20/21 9:30 AM 1.0
6 7 SMITH, JOHN 10/4/21 9:20 AM 10/20/21 10:30 AM 2.0
7 8 SMITH, JOHN 10/4/21 11:20 AM 10/20/21 12:00 PM 3.0
8 9 SMITH, JOHN 10/4/21 1:00 PM 10/20/21 2:15 PM 1.0
9 10 SMITH, JOHN 10/4/21 3:15 PM 10/20/21 4:00 PM 1.0