Hi I have two dataframes
df1
>>> print(df1)
Unnamed: 0 Latitude Longitude Altitude
0 0 -89.9 -180.0 0.0
1 1 -89.9 -179.0 0.0
2 2 -89.9 -178.0 0.0
3 3 -89.9 -177.0 0.0
4 4 -89.9 -176.0 0.0
... ... ... ... ...
64795 64795 89.1 175.0 267.0
64796 64796 89.1 176.0 275.0
64797 64797 89.1 177.0 283.0
64798 64798 89.1 178.0 288.0
64799 64799 89.1 179.0 292.0
[64800 rows x 4 columns]
df2
>>> print(df2)
Unnamed: 0 Latitude Longitude Value
0 0 -89.9 -180.0 -7.414481
1 1 -89.9 -179.0 -7.413804
2 2 -89.9 -178.0 -7.413334
3 3 -89.9 -177.0 -7.413073
4 4 -89.9 -176.0 -7.413023
... ... ... ... ...
64795 64795 89.1 175.0 -7.790705
64796 64796 89.1 176.0 -7.783264
64797 64797 89.1 177.0 -7.776208
64798 64798 89.1 178.0 -7.769540
64799 64799 89.1 179.0 -7.763257
[64800 rows x 4 columns]
I want to merge the two dataframes and change the values from df2 as a function of the height from df1. The cooridinates are the same for both dfs, so why is there an increase in data
def recomb_data(df1, df2):
dfdiff = pd.merge(df1, df2,
on=['Latitude', 'Longitude'],
how='outer')
dfdiff["a_diffO"] = (dfdiff["Altitude"] *-0.002)
sum_column = (dfdiff["Value"] dfdiff["a_diffO"])
dfdiff['ValueHeight'] = sum_column
return (dfdiff)
However, the dataframe has significantly moredata.
>>> print(dfdiff)
Unnamed: 0_x Latitude Longitude Altitude Unnamed: 0_y Value a_diffO ValueHeight
0 0 -89.9 -180.0 0.0 0 -7.414481 -0.000 -7.414481
1 1 -89.9 -179.0 0.0 1 -7.413804 -0.000 -7.413804
2 2 -89.9 -178.0 0.0 2 -7.413334 -0.000 -7.413334
3 3 -89.9 -177.0 0.0 3 -7.413073 -0.000 -7.413073
4 4 -89.9 -176.0 0.0 4 -7.413023 -0.000 -7.413023
... ... ... ... ... ... ... ... ...
67551 64795 89.1 175.0 267.0 64795 -7.790705 -0.534 -8.324705
67552 64796 89.1 176.0 275.0 64796 -7.783264 -0.550 -8.333264
67553 64797 89.1 177.0 283.0 64797 -7.776208 -0.566 -8.342208
67554 64798 89.1 178.0 288.0 64798 -7.769540 -0.576 -8.345540
67555 64799 89.1 179.0 292.0 64799 -7.763257 -0.584 -8.347257
[67556 rows x 8 columns]
What am I doing wrong? Am I using the wrong merging method or what could be the cause of the error?
my code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
# from Subfunctions.Extract_elevation import dfl
# from Subfunctions.Difference_calculation import dfdiff, dfdiffe
from mpl_toolkits.basemap import Basemap
def load_elev():
el = pd.read_csv(r'Df1')
el.drop_duplicates(subset=['Longitude', 'Latitude'])
return el
def Load_grid():
grd = pd.read_csv(r'Df2')
grd.drop_duplicates(subset=['Longitude', 'Latitude'])
return grd
def recomb_data(el, grd):
dfdiff = pd.merge(el, grd,
on=['Latitude', 'Longitude'],
how='inner')
dfdiff["a_diffO"] = (dfdiff["Altitude"] *-0.002)
sum_column = (dfdiff["Value"] dfdiff["a_diffO"])
dfdiff['ValueHeight'] = sum_column
return (dfdiff)
el = load_elev()
grd = Load_grid()
print(el)
print(grd)
(el.sort_values(['Latitude', 'Longitude'])[['Latitude', 'Longitude']].values == grd.sort_values(['Latitude', 'Longitude'])[['Latitude', 'Longitude']].values).all()
dfdiff = recomb_data(el, grd)
print(dfdiff)
CodePudding user response:
If you are sure the latitude and longitude columns are identical between the two datasets, then the only explanation for a larger result after the merge is that some rows contain duplicate coordinates. Try using drop_duplicates() on both inputs before the merge to see if that changes the result.