I have two dataframes and I want to compare df2 with df1 in order to show the "id_number missing" that are not in df1 but are in df2.
I have a script where I end up with more missing values than expected because I am not comparing the two dataframes on the same date range.
I want to compare the two dataframes on the same date range. For example I don't wan't the id_number "BXhshhsd589" to be shown in the result because the shared date range between the two tables starts from 5/09/22 and end on 10/09/22 (and not beyond). I was looking for an automatic way to find these "shared date range" without having to manually write them.
How can I do this?
Here is my code:
import pandas as pd
data1 = {'date': ['5/09/22', '7/09/22', '7/09/22','10/09/22'],
'second_column': ['first_value', 'second_value', 'third_value','fourth_value'],
'id_number':['AA576bdk89', 'GG6jabkhd589', 'BXV6jabd589','BXzadzd589'],
'fourth_column':['first_value', 'second_value', 'third_value','fourth_value'],
}
data2 = {'date': ['5/09/22', '7/09/22', '7/09/22', '7/09/22', '7/09/22', '11/09/22'],
'second_column': ['first_value', 'second_value', 'third_value','fourth_value', 'fifth_value','sixth_value'],
'id_number':['AA576bdk89', 'GG6jabkhd589', 'BXV6jabd589','BXV6mkjdd589','GGdbkz589', 'BXhshhsd589'],
'fourth_column':['first_value', 'second_value', 'third_value','fourth_value', 'fifth_value','sixth_value'],
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
m = df2.merge(df1, on=['date','id_number'], how='left', indicator=True)['_merge'].ne('both')
df = df2[m]
print (df)
CodePudding user response:
Convert your date
columns to a datetime
object and then compare the .min()
and .max()
values of the columns in the dataframes:
df1.loc[:,'date'], df2.loc[:,'date'] = (pd.to_datetime(col) for col in [df1.date, df2.date])
df1_cut = df1[df1.date.ge(df2.date.min()) & df1.date.le(df2.date.max())]
df2_cut = df2[df2.date.ge(df1.date.min()) & df2.date.le(df1.date.max())]
merged = df2_cut.merge(df1_cut, on=['date','id_number'], how='left', indicator=True)['_merge'].ne('both')
df = df2_cut[merged]
print(df)