this is a very general question. Is it possible that by performing an inner join in pandas, the resulting merged db has more observations than the maximum observation number of the two datasets. In other words, if I have a db with 30181537 obs and a database with 23483111 observations, how is it possible that the resulting database has #112039626 observations if I perform an inner merge on a variable v1? Variable v1 contains duplicates in both datasets.
Thanks
CodePudding user response:
Because you have duplicates of the v1
column in both data sets, you'll get i * j
rows with that merge column value, where i
is the number of rows with that value in dataframe A and j
is the number of rows with that value in dataframe B.
If you don't want this, try using
df_A = df_A.drop_duplicates(subset=['v1'])
df_B = df_B.drop_duplicates(subset=['v1'])