I have two dataframes:
- The first one was extracted from the manifest database.The data explains about the value, route (origin and destination), and also the actual SLA
awb_number route value sla_actual (days)
01 A - B 24,000 2
02 A - C 25,000 3
03 C - B 29,000 5
04 B - D 35,000 6
- The second dataframe explains about the route (origin and destination) and also internal SLA (3PL SLA).
route sla_partner (days)
A - B 4
B - A 3
A - C 3
B - D 5
I would like to investigate the gap between the SLA actual and 3PL SLA, so what I do is to join these two dataframes based on the routes.
I supposed the result would be like this:
awb_number route value sla_actual sla_partner
01 A - B 24,000 2 4
02 A - C 25,000 3 3
03 C - B 29,000 5 NaN
04 B - D 35,000 6 5
What I have done is:
df_sla_check = pd.merge(df_actual, df_sla_partner, on = ['route_city_lazada'], how = 'inner')
The first dataframe has 36,000 rows while the second dataframe has 20,000 rows, but the result returns over 700,000 rows. Is there something wrong with my logic? Isn't it supposed to return around 20,000 rows - 36,000 rows?
Can somebody help me how to do this correctly?
Thank you in advance
CodePudding user response:
Apply Left outer Join. I think it will solve the problem.
CodePudding user response:
According to the points raised by @boi-doingthings and @Peddi Santhoshkumar, I would also suggest to use left
joiner, such as the following for your datasets:
df_sla_check = pd.merge(df_actual, df_sla_partner, on=['route'], how='left')
For what you are showing, 'route'
may be the appropriate name for your column.
CodePudding user response:
Please confirm the joining field passed in on
argument.
Further, you should check the number of unique keys on which the join is happening.
The most natural cause of the spike in the joined dataframe is that one record of df1
gets mapped to multiple records of df2
and vice-versa.
df1.route.value_counts()
df2.route.value_counts()
The alternate way is to change how
parameter to 'left'.