Home > Software design >  Erroneous merging of two data frames
Erroneous merging of two data frames

Time:06-25

Hi I have two dataframes

df1

>>> print(df1)
       Unnamed: 0  Latitude  Longitude  Altitude
0               0     -89.9     -180.0       0.0
1               1     -89.9     -179.0       0.0
2               2     -89.9     -178.0       0.0
3               3     -89.9     -177.0       0.0
4               4     -89.9     -176.0       0.0
...           ...       ...        ...       ...
64795       64795      89.1      175.0     267.0
64796       64796      89.1      176.0     275.0
64797       64797      89.1      177.0     283.0
64798       64798      89.1      178.0     288.0
64799       64799      89.1      179.0     292.0

[64800 rows x 4 columns]

df2

>>> print(df2)
       Unnamed: 0  Latitude  Longitude     Value
0               0     -89.9     -180.0 -7.414481
1               1     -89.9     -179.0 -7.413804
2               2     -89.9     -178.0 -7.413334
3               3     -89.9     -177.0 -7.413073
4               4     -89.9     -176.0 -7.413023
...           ...       ...        ...       ...
64795       64795      89.1      175.0 -7.790705
64796       64796      89.1      176.0 -7.783264
64797       64797      89.1      177.0 -7.776208
64798       64798      89.1      178.0 -7.769540
64799       64799      89.1      179.0 -7.763257

[64800 rows x 4 columns]

I want to merge the two dataframes and change the values from df2 as a function of the height from df1. The cooridinates are the same for both dfs, so why is there an increase in data

def recomb_data(df1, df2):
    dfdiff = pd.merge(df1, df2, 
                  on=['Latitude', 'Longitude'], 
                  how='outer')
    dfdiff["a_diffO"] = (dfdiff["Altitude"] *-0.002) 
    sum_column = (dfdiff["Value"]  dfdiff["a_diffO"])
    dfdiff['ValueHeight'] = sum_column
    return (dfdiff)

However, the dataframe has significantly moredata.

>>> print(dfdiff)
       Unnamed: 0_x  Latitude  Longitude  Altitude  Unnamed: 0_y     Value  a_diffO  ValueHeight
0                 0     -89.9     -180.0       0.0             0 -7.414481   -0.000    -7.414481
1                 1     -89.9     -179.0       0.0             1 -7.413804   -0.000    -7.413804
2                 2     -89.9     -178.0       0.0             2 -7.413334   -0.000    -7.413334
3                 3     -89.9     -177.0       0.0             3 -7.413073   -0.000    -7.413073
4                 4     -89.9     -176.0       0.0             4 -7.413023   -0.000    -7.413023
...             ...       ...        ...       ...           ...       ...      ...          ...
67551         64795      89.1      175.0     267.0         64795 -7.790705   -0.534    -8.324705
67552         64796      89.1      176.0     275.0         64796 -7.783264   -0.550    -8.333264
67553         64797      89.1      177.0     283.0         64797 -7.776208   -0.566    -8.342208
67554         64798      89.1      178.0     288.0         64798 -7.769540   -0.576    -8.345540
67555         64799      89.1      179.0     292.0         64799 -7.763257   -0.584    -8.347257

[67556 rows x 8 columns]

What am I doing wrong? Am I using the wrong merging method or what could be the cause of the error?

my code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
# from Subfunctions.Extract_elevation import dfl
# from Subfunctions.Difference_calculation import dfdiff, dfdiffe
from mpl_toolkits.basemap import Basemap

def load_elev():
    el = pd.read_csv(r'Df1')
    el.drop_duplicates(subset=['Longitude', 'Latitude'])
    return el

def Load_grid():
    grd = pd.read_csv(r'Df2')
    grd.drop_duplicates(subset=['Longitude', 'Latitude'])
    return grd

def recomb_data(el, grd):
    dfdiff = pd.merge(el, grd, 
                  on=['Latitude', 'Longitude'], 
                  how='inner')
    dfdiff["a_diffO"] = (dfdiff["Altitude"] *-0.002) 
    sum_column = (dfdiff["Value"]  dfdiff["a_diffO"])
    dfdiff['ValueHeight'] = sum_column
    return (dfdiff)


el = load_elev()
grd = Load_grid()
print(el)
print(grd)
(el.sort_values(['Latitude', 'Longitude'])[['Latitude', 'Longitude']].values == grd.sort_values(['Latitude', 'Longitude'])[['Latitude', 'Longitude']].values).all()

dfdiff = recomb_data(el, grd)
print(dfdiff)

CodePudding user response:

If you are sure the latitude and longitude columns are identical between the two datasets, then the only explanation for a larger result after the merge is that some rows contain duplicate coordinates. Try using drop_duplicates() on both inputs before the merge to see if that changes the result.

  • Related