Inner join in pandas-CodePudding

I have two dataframes:

The first one was extracted from the manifest database.The data explains about the value, route (origin and destination), and also the actual SLA

awb_number   route    value       sla_actual (days)
01           A - B    24,000          2
02           A - C    25,000          3
03           C - B    29,000          5
04           B - D    35,000          6

The second dataframe explains about the route (origin and destination) and also internal SLA (3PL SLA).

route       sla_partner (days)
A - B           4
B - A           3
A - C           3
B - D           5

I would like to investigate the gap between the SLA actual and 3PL SLA, so what I do is to join these two dataframes based on the routes.

I supposed the result would be like this:

awb_number    route    value      sla_actual     sla_partner
01            A - B    24,000         2             4
02            A - C    25,000         3             3
03            C - B    29,000         5             NaN
04            B - D    35,000         6             5

What I have done is:

df_sla_check = pd.merge(df_actual, df_sla_partner, on = ['route_city_lazada'], how = 'inner')

The first dataframe has 36,000 rows while the second dataframe has 20,000 rows, but the result returns over 700,000 rows. Is there something wrong with my logic? Isn't it supposed to return around 20,000 rows - 36,000 rows?

Can somebody help me how to do this correctly?

Thank you in advance

CodePudding user response：

Apply Left outer Join. I think it will solve the problem.

CodePudding user response：

According to the points raised by @boi-doingthings and @Peddi Santhoshkumar, I would also suggest to use left joiner, such as the following for your datasets:

df_sla_check = pd.merge(df_actual, df_sla_partner, on=['route'], how='left')

For what you are showing, 'route' may be the appropriate name for your column.

CodePudding user response：

Please confirm the joining field passed in on argument. Further, you should check the number of unique keys on which the join is happening. The most natural cause of the spike in the joined dataframe is that one record of df1 gets mapped to multiple records of df2 and vice-versa.

df1.route.value_counts()
df2.route.value_counts()

The alternate way is to change how parameter to 'left'.