Home > front end >  Adding a dataframe column of a lookup matching multiple columns, but not a merge / join
Adding a dataframe column of a lookup matching multiple columns, but not a merge / join

Time:06-08

Simplified dfs:

df = pd.DataFrame(
    {
        "ID": [6, 2, 4],
        "ID2": [2, 3, 4],
        "foo": [2, 3, 4],
    }
)

df2 = pd.DataFrame(
    {
        "ID": [6, 2, 4],
        "ID2": [1, 3, 2],
        "foo": [2, 3, 4],
        "to_ignore": ["idk", "bar", "whatever"],
        "A": ["unwanted str", "desired str", "unwanted str"],
    }
)

We want to add a column to df which has values from df2's A column where values from columns ID and ID2 match like so:

   ID  ID2       NewCol
0   6    2          NaN
1   2    3  desired str
2   4    4          NaN

We can do this where one column's values match:

df["NewCol"] = df["ID"].map(df2.drop_duplicates("ID").set_index("ID")["A"])

How can we make this only work for entries where both ID and ID2 match? df["NewCol"] = df["ID", "ID2"].map(df2.drop_duplicates("ID").set_index(["ID", "ID2"])["A"]) obviously doesn't work.

We could concatenate the ID cols and crudely use this above approach while risking concatenated ID strs incorrectly matching:

for d in [df, df2]:
    d['lookup_combined'] = d["ID"]   d["ID2"]

df["NewCol"] = df["lookup_combined"].map(df2.drop_duplicates("lookup_combined").set_index("lookup_combined")["A"])

for d in [df, df2]:
    d.drop(columns="lookup_combined", inplace=True)

But this is clunky and error prone. Can we do better?

Note we don't want to merge / join the tables as there are more columns and other complications removed for question simplicity. We only want a new column named by a string we supply.

CodePudding user response:

You are looking for the function merge. You just need to indicate the columns on which you are merging on as follows:

df.merge(df2, how='left', on = ['ID', 'ID2'])
 
   ID  ID2            A
0   6    2          NaN
1   2    3  desired str
2   4    4          NaN

Edit:

I still do not see where merge is an issue: with new data, just do:

vars = ['ID', 'ID2']
df[vars].merge(df2[vars   ['A']], how='left', on =vars)

   ID  ID2            A
0   6    2          NaN
1   2    3  desired str
2   4    4          NaN
  • Related