Home > Blockchain >  Conditionally merge dataframes based on multiple columns in Pandas
Conditionally merge dataframes based on multiple columns in Pandas

Time:10-01

I have 2 dataframes, with df1 containing the coordinates of points and df2 containing the start and end coordinates of lines. Such as:

df1

point lon lat
P01 3 48
P02 9 44

df2

line lon.start lat.start lon.end lat.end
L01 3 48 3 49

I would like to replace the line variable in df2 with point.start and point.end, conditionally creating the column point.start based on lon.start and lat.start, and the same for point.end. Such that it becomes:

point.start point.end lon.start lat.start lon.end lat.end
P01 P05 3 48 3 49

How could I do this?

CodePudding user response:

Using merge may help you here.


# generate sample data
df1 = pd.DataFrame([['P01', 3, 48], ['P02', 9, 44], ['P03', 3, 49]], columns=('point', 'lon', 'lat'))
df2 = pd.DataFrame([['L01', 3, 48, 3, 49], ['L02', 4, 48, 4, 49]], columns=('line', 'lon.start', 'lat.start', 'lon.end', 'lat.end'))

# handle start points
dfx = pd.merge(df1.rename(columns={'lon': 'lon.start', 'lat': 'lat.start', 'point': 'point.start'}),
               df2[['line', 'lon.start', 'lat.start']],
               on=['lon.start', 'lat.start'])

# handle end points
dfy = pd.merge(df1.rename(columns={'lon': 'lon.end', 'lat': 'lat.end', 'point': 'point.end'}),
               df2[['line', 'lon.end', 'lat.end']],
               on=['lon.end', 'lat.end'])

# merge both based on lines
df = pd.merge(dfx, dfy, on='line')

The provided answer will discard lines which have missing points. If you'd rather keep them and not change the shape of df2, replace the last merge by

df = pd.merge(dfx, dfy, on='line', how='outer')

CodePudding user response:

You could do:

df3 = df2.drop(columns="line")
for suffix in ("start", "end"):
    df3 = df3.merge(
        df1.rename(columns={col: f"{col}.{suffix}" for col in df1.columns}),
        on=[f"{col}.{suffix}" for col in ("lon", "lat")],
        how="left"
    )

If the order of the columns is important then you have reorder them afterwards:

df3 = df3[[f"{col}.{suffix}" for col in df1.columns for suffix in ("start", "end")]]
  • Related