I have the following two dataframes with different size.
df1: (with 40,000 data)
Date Latitude Longitude Car_Accident
2/10/22 43.69 -79.4 40
2/10/22 43.69 -79.4 34
2/20/22 43.68 -79.0 30
2/20/22 43.50 -79.1 35
............................
..............................
df2: (with 20,000 data)
Date Latitude Longitude Route
2/20/22 43.68 -79.0 111
2/20/22 43.50 -79.1 123
2/10/22 43.69 -79.4 124
2/10/22 43.69 -79.4 124
............................
.............................
I want to merge them applying condition on three different columns.
if the value of df1['Date']==df2['Date'] & df1[Latitude]==df2[Latitude] & df1[Longitude]==df2[Longitude]
then we will add the column with having data who satisfied this condition
Example
final df:
Date Latitude Longitude Car_Accident Route
2/20/22 43.50 -79.1 35 123
2/20/22 43.68 -79.0 30 111
2/10/22 43.69 -79.4 40 124
2/10/22 43.69 -79.4 34 124
Method I tried:
I tried inner join as well as indexing. Sometimes I had memory error, sometimes it gives null table or left table if I use left join.pandas.merge(df1, df2, on=[Date,Lat,Long]) or dataframe.merge
anything is not giving proper answer.
Problem I faced:
In inner join cartesian product is consuming huge computation process and also I don't is this the right method.
CodePudding user response:
Using pandas.DataFrame.drop_duplicates
right after the left merge should give you the result you're looking for:
final_df= (
df1.merge(df2, on=["Date", "Latitude", "Longitude"], how="left")
.drop_duplicates(ignore_index=True)
)
# Output :
print(final_df)
Date Latitude Longitude Car_Accident Route
0 2/10/22 43.69 -79.4 40 124
1 2/10/22 43.69 -79.4 34 124
2 2/20/22 43.68 -79.0 30 111
3 2/20/22 43.50 -79.1 35 123