Home > Back-end >  Lookup and fill values in one data frame by matching closest value in another
Lookup and fill values in one data frame by matching closest value in another

Time:08-31

Hi I have 2 dataframes:

df 1:

Year City     Address  Total_Area Basement_Area
1989 Ottawa   empty    280         130

Second data frame is a bigger one where I need to match closest neighbour by Total Area and basement Area

df2:

Year Address(with city in it) Total_Area Basement_Area
1989 123 Test Ottawa          279        120
1989 146 Test Ottawa          276        140
1989 156 Test Ottawa          283        134

Expected result: df1:

Year City     Address       Total_Area Basement_Area
1989 Ottawa   156 Test Ottawa    280         130

I am struggling to find a correct way to do this. I made a function to extract filter values from df1 but can't get it to filter and give me the address

def filter_df(df,year,city,basement=0, floor_area=0, heated_floor_area=0,mp_storeys=0):
    if basement == 0:
        df = df[df[const.BASEMENT_COLUMN_df2_NAME].isnull()]
    else:
        df = df[(df[const.BASEMENT_COLUMN_df2_NAME]- floor_area).abs()< 10]
    


    filtered_df = df[(df[const.YEAR_COLUMN_df2_NAME] == year)]
    filtered_df = df[(df[const.ADDRESS_COLUMN_df2_NAME].str.contains(city))]
    filtered_df = df[(df[const.TOTAL_AREA_COLUMN_df2_NAME] - floor_area).abs()< 10]

    return filtered_df[0]

Please advise.

CodePudding user response:

I'm sure there are more elegant solutions, but this one works.

I've added another entry to the dataframes for testing:

data1 = {'Year': [1989, 1990],
     'City': ['Ottawa', 'Chicago'],
     'Address': ['empty', 'empty'],
     'Total_Area': [280, 340],
     'Basement_Area': [130, 210]}
df1 = pd.DataFrame(data1)
data2 = {'Year': [1989, 1989, 1989, 1990, 1990],
     'Address(with city in it)': ['123 Test Ottawa', '146 Test Ottawa', '156 Test Ottawa', 
     '190 Test Chicago', '201 Test Chicago'],
     'Total_Area': [179, 276, 283, 343, 338],
     'Basement_Area': [120, 140, 134, 210, 214]}
df2 = pd.DataFrame(data2)

Add City column to df2:

df2['City'] = [entry[2] for entry in df2['Address(with city in it)'].str.split(' ')]

Iterate over the rows in df1 and add the entry from df2 with the minimum absolute difference for Total_Area and Basement_Area.

for row in df1.iterrows():
    index = row[0]
    year, city, address, total_area, basement_area = row[1]
    df_compare = df2[(df2.Year == year) & (df2.City == city)]
    df1.loc[index, 'Address'] = df2.loc[
        (abs(df_compare[['Total_Area', 'Basement_Area']] - [total_area, basement_area])
        ).sum(axis=1).idxmin()]['Address(with city in it)']

It's more pythonic to use something like .map() for the last step, but I wasn't able to get this working.

CodePudding user response:

I would use a cross join, or a join on a common feature (looks like you want the same year for example to match your closest neighbour), if you df2 is not too big (you can quickly have memory issue else).

Something like :

df = df1.reset_index().merge(
    df2,
    on="Year",
    how="inner",
    suffixes=("_df1", "_df2")
)

df["diff"] = abs(df["Basement_Area_df1"] - df["Basement_Area_df2"])   abs(df["Total_Area_df1"] - df["Total_Area_df2"])

df.sort_values(["index", "diff"]).groupby("index").head(1)

Note that I'm using the index of df1 (that I get via reset_index() to keep only one match per df1 rows (hence the use of groupby / head).

If you df2 is too large to perform this kind of join, even if you add some features in your merge, you should take a look at the library recordlinkage which may help you with this kind of "inexact" join.

  • Related