Conditionally merge dataframes based on multiple columns in Pandas-CodePudding

I have 2 dataframes, with df1 containing the coordinates of points and df2 containing the start and end coordinates of lines. Such as:

df1

point	lon	lat
P01	3	48
P02	9	44

df2

line	lon.start	lat.start	lon.end	lat.end
L01	3	48	3	49

I would like to replace the line variable in df2 with point.start and point.end, conditionally creating the column point.start based on lon.start and lat.start, and the same for point.end. Such that it becomes:

point.start	point.end	lon.start	lat.start	lon.end	lat.end
P01	P05	3	48	3	49

How could I do this?

CodePudding user response：

Using merge may help you here.


# generate sample data
df1 = pd.DataFrame([['P01', 3, 48], ['P02', 9, 44], ['P03', 3, 49]], columns=('point', 'lon', 'lat'))
df2 = pd.DataFrame([['L01', 3, 48, 3, 49], ['L02', 4, 48, 4, 49]], columns=('line', 'lon.start', 'lat.start', 'lon.end', 'lat.end'))

# handle start points
dfx = pd.merge(df1.rename(columns={'lon': 'lon.start', 'lat': 'lat.start', 'point': 'point.start'}),
               df2[['line', 'lon.start', 'lat.start']],
               on=['lon.start', 'lat.start'])

# handle end points
dfy = pd.merge(df1.rename(columns={'lon': 'lon.end', 'lat': 'lat.end', 'point': 'point.end'}),
               df2[['line', 'lon.end', 'lat.end']],
               on=['lon.end', 'lat.end'])

# merge both based on lines
df = pd.merge(dfx, dfy, on='line')

The provided answer will discard lines which have missing points. If you'd rather keep them and not change the shape of df2, replace the last merge by

df = pd.merge(dfx, dfy, on='line', how='outer')

CodePudding user response：

You could do:

df3 = df2.drop(columns="line")
for suffix in ("start", "end"):
    df3 = df3.merge(
        df1.rename(columns={col: f"{col}.{suffix}" for col in df1.columns}),
        on=[f"{col}.{suffix}" for col in ("lon", "lat")],
        how="left"
    )

If the order of the columns is important then you have reorder them afterwards:

df3 = df3[[f"{col}.{suffix}" for col in df1.columns for suffix in ("start", "end")]]