comparing two dataframes on the same date range python-CodePudding

I have two dataframes and I want to compare df2 with df1 in order to show the "id_number missing" that are not in df1 but are in df2.

I have a script where I end up with more missing values than expected because I am not comparing the two dataframes on the same date range.

I want to compare the two dataframes on the same date range. For example I don't wan't the id_number "BXhshhsd589" to be shown in the result because the shared date range between the two tables starts from 5/09/22 and end on 10/09/22 (and not beyond). I was looking for an automatic way to find these "shared date range" without having to manually write them.

How can I do this?

Here is my code:

import pandas as pd

data1 = {'date':  ['5/09/22', '7/09/22', '7/09/22','10/09/22'],
        'second_column': ['first_value', 'second_value', 'third_value','fourth_value'],
         'id_number':['AA576bdk89', 'GG6jabkhd589', 'BXV6jabd589','BXzadzd589'],
        'fourth_column':['first_value', 'second_value', 'third_value','fourth_value'],
        }

data2 = {'date':  ['5/09/22', '7/09/22', '7/09/22', '7/09/22', '7/09/22', '11/09/22'],
        'second_column': ['first_value', 'second_value', 'third_value','fourth_value', 'fifth_value','sixth_value'],
         'id_number':['AA576bdk89', 'GG6jabkhd589', 'BXV6jabd589','BXV6mkjdd589','GGdbkz589', 'BXhshhsd589'],
        'fourth_column':['first_value', 'second_value', 'third_value','fourth_value', 'fifth_value','sixth_value'],
        }

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)


m = df2.merge(df1, on=['date','id_number'], how='left', indicator=True)['_merge'].ne('both')
df = df2[m]
print (df)

CodePudding user response：

Convert your date columns to a datetime object and then compare the .min() and .max() values of the columns in the dataframes:

df1.loc[:,'date'], df2.loc[:,'date'] = (pd.to_datetime(col) for col in [df1.date, df2.date])
df1_cut = df1[df1.date.ge(df2.date.min()) & df1.date.le(df2.date.max())]
df2_cut = df2[df2.date.ge(df1.date.min()) & df2.date.le(df1.date.max())]
merged = df2_cut.merge(df1_cut, on=['date','id_number'], how='left', indicator=True)['_merge'].ne('both')
df = df2_cut[merged]
print(df)