Home > Blockchain >  Merge number of records to dataframe based on end date and start date of individual record
Merge number of records to dataframe based on end date and start date of individual record

Time:12-29

I have these two initial tables:

Table1:

CustID StartTime EndTime Area
1 12/1/2022 4:00:00 PM 12/1/2022 4:05:00 PM ABC
2 12/1/2022 4:02:00 PM 12/1/2022 4:03:00 PM ABC

Table2:

Area StartTime EndTime
ABC 12/1/2022 4:01:26 PM 12/1/2022 4:02:00 PM
ABC 12/1/2022 4:02:05 PM 12/1/2022 4:02:55 PM
ABC 12/1/2022 4:04:10 PM 12/1/2022 4:05:00 PM

I need to end up with this:

Table3:

CustID StartTime EndTime Area #ofRecords
1 12/1/2022 4:00:00 PM 12/1/2022 4:05:00 PM ABC 3
2 12/1/2022 4:02:00 PM 12/1/2022 4:03:00 PM ABC 1

Would need an efficient way to merge the tables as the volume of rows is huge.

Using the StartTime, EndTime and Area from Table1, to determine how many records of those conditions are in Table2. Then adding a column to show this number of records.

I've tried left joining tables but its taking too long.

CodePudding user response:

This might not be the most efficient way, but is one of the possible ways to achieve your goal.

I defined a function to calculate the number of records #ofRecords and used apply() for it, as follows:

import pandas as pd

df1 = pd.DataFrame({
    'CustID': [1, 2],
    'StartTime': ['12/1/2022 4:00:00 PM', '12/1/2022 4:02:00 PM'],
    'EndTime': ['12/1/2022 4:05:00 PM', '12/1/2022 4:03:00 PM'],
    'Area': ['ABC', 'ABC'],
})

df2 = pd.DataFrame({
    'Area': ['ABC', 'ABC', 'ABC'],
    'StartTime': ['12/1/2022 4:01:26 PM', '12/1/2022 4:02:05 PM', '12/1/2022 4:04:10 PM'],
    'EndTime': ['12/1/2022 4:02:00 PM', '12/1/2022 4:02:55 PM', '12/1/2022 4:05:00 PM'],
})

print(df1)
print(df2)

"""
   CustID             StartTime               EndTime Area
0       1  12/1/2022 4:00:00 PM  12/1/2022 4:05:00 PM  ABC
1       2  12/1/2022 4:02:00 PM  12/1/2022 4:03:00 PM  ABC

  Area             StartTime               EndTime
0  ABC  12/1/2022 4:01:26 PM  12/1/2022 4:02:00 PM
1  ABC  12/1/2022 4:02:05 PM  12/1/2022 4:02:55 PM
2  ABC  12/1/2022 4:04:10 PM  12/1/2022 4:05:00 PM
"""

# Convert a type of the StartTime and EndTime columns to datetime type
df1['StartTime'] = pd.to_datetime(df1['StartTime'])
df1['EndTime'] = pd.to_datetime(df1['EndTime'])
df2['StartTime'] = pd.to_datetime(df2['StartTime'])
df2['EndTime'] = pd.to_datetime(df2['EndTime'])


def row_counter(st, et):
    # Calculate the number of records
    counter = 0
    for i, row in df2.iterrows():
        if (st <= row['StartTime']) and (et >= row['EndTime']):
            counter  = 1
    return counter


df3 = df1.copy()  # create a new dataframe by coping df1
df3['#ofRecords'] = df3.apply(lambda x: row_counter(x['StartTime'], x['EndTime']), axis=1)
print(df3)

"""
   CustID           StartTime             EndTime Area  #ofRecords
0       1 2022-12-01 16:00:00 2022-12-01 16:05:00  ABC           3
1       2 2022-12-01 16:02:00 2022-12-01 16:03:00  ABC           1
"""

CodePudding user response:

assuming the first table is loaded in df1 and the second one in df2, and StartDate and EndDate are of datetime type you can do as follow:

df1['#ofRecords'] = None
for i, rows in df1.iterrows():
    count = df2[(df2['StartTime']>= rows.StartTime)&(df2['EndTime']<= rows.EndTime)].shape[0]
    df1['#ofRecords'].loc[i] = count
  • Related