I have these two initial tables:
Table1:
CustID | StartTime | EndTime | Area |
---|---|---|---|
1 | 12/1/2022 4:00:00 PM | 12/1/2022 4:05:00 PM | ABC |
2 | 12/1/2022 4:02:00 PM | 12/1/2022 4:03:00 PM | ABC |
Table2:
Area | StartTime | EndTime |
---|---|---|
ABC | 12/1/2022 4:01:26 PM | 12/1/2022 4:02:00 PM |
ABC | 12/1/2022 4:02:05 PM | 12/1/2022 4:02:55 PM |
ABC | 12/1/2022 4:04:10 PM | 12/1/2022 4:05:00 PM |
I need to end up with this:
Table3:
CustID | StartTime | EndTime | Area | #ofRecords |
---|---|---|---|---|
1 | 12/1/2022 4:00:00 PM | 12/1/2022 4:05:00 PM | ABC | 3 |
2 | 12/1/2022 4:02:00 PM | 12/1/2022 4:03:00 PM | ABC | 1 |
Would need an efficient way to merge the tables as the volume of rows is huge.
Using the StartTime, EndTime and Area from Table1, to determine how many records of those conditions are in Table2. Then adding a column to show this number of records.
I've tried left joining tables but its taking too long.
CodePudding user response:
This might not be the most efficient way, but is one of the possible ways to achieve your goal.
I defined a function to calculate the number of records #ofRecords
and used apply()
for it, as follows:
import pandas as pd
df1 = pd.DataFrame({
'CustID': [1, 2],
'StartTime': ['12/1/2022 4:00:00 PM', '12/1/2022 4:02:00 PM'],
'EndTime': ['12/1/2022 4:05:00 PM', '12/1/2022 4:03:00 PM'],
'Area': ['ABC', 'ABC'],
})
df2 = pd.DataFrame({
'Area': ['ABC', 'ABC', 'ABC'],
'StartTime': ['12/1/2022 4:01:26 PM', '12/1/2022 4:02:05 PM', '12/1/2022 4:04:10 PM'],
'EndTime': ['12/1/2022 4:02:00 PM', '12/1/2022 4:02:55 PM', '12/1/2022 4:05:00 PM'],
})
print(df1)
print(df2)
"""
CustID StartTime EndTime Area
0 1 12/1/2022 4:00:00 PM 12/1/2022 4:05:00 PM ABC
1 2 12/1/2022 4:02:00 PM 12/1/2022 4:03:00 PM ABC
Area StartTime EndTime
0 ABC 12/1/2022 4:01:26 PM 12/1/2022 4:02:00 PM
1 ABC 12/1/2022 4:02:05 PM 12/1/2022 4:02:55 PM
2 ABC 12/1/2022 4:04:10 PM 12/1/2022 4:05:00 PM
"""
# Convert a type of the StartTime and EndTime columns to datetime type
df1['StartTime'] = pd.to_datetime(df1['StartTime'])
df1['EndTime'] = pd.to_datetime(df1['EndTime'])
df2['StartTime'] = pd.to_datetime(df2['StartTime'])
df2['EndTime'] = pd.to_datetime(df2['EndTime'])
def row_counter(st, et):
# Calculate the number of records
counter = 0
for i, row in df2.iterrows():
if (st <= row['StartTime']) and (et >= row['EndTime']):
counter = 1
return counter
df3 = df1.copy() # create a new dataframe by coping df1
df3['#ofRecords'] = df3.apply(lambda x: row_counter(x['StartTime'], x['EndTime']), axis=1)
print(df3)
"""
CustID StartTime EndTime Area #ofRecords
0 1 2022-12-01 16:00:00 2022-12-01 16:05:00 ABC 3
1 2 2022-12-01 16:02:00 2022-12-01 16:03:00 ABC 1
"""
CodePudding user response:
assuming the first table is loaded in df1 and the second one in df2, and StartDate and EndDate are of datetime type you can do as follow:
df1['#ofRecords'] = None
for i, rows in df1.iterrows():
count = df2[(df2['StartTime']>= rows.StartTime)&(df2['EndTime']<= rows.EndTime)].shape[0]
df1['#ofRecords'].loc[i] = count