Home > Blockchain >  Performing the equivalent of a vlookup within a merged df in Pandas
Performing the equivalent of a vlookup within a merged df in Pandas

Time:10-07

I had no pandas/python experience this time last week so I have had a steep learning curve in trying to transfer a complex, multi-step process that was being done in excel, into pandas. Sorry if the following is unclear.

I merged 2 dataframes. I have a column, let's call it 'new_ID', with new ID names from originaldf1, some of which say 'no match was found'. For the 'no match was found' entries I would like to get the old ID number from originaldf2, which is another column in currentdf, let's call this col 'old_ID'. So, I would like to do something like an excel vlookup where I say: "if there is 'no match was found' in col 'new_ID', give me the ID that is in col 'old_ID', in that same row". The output I would like is just a list of all the old IDs where no match was found.

I've tried a few solutions that I found on here but all just give me blank outputs. I'm assuming this is because they aren't searching each individual instance of "no match found". For example I tried:

deletes = mydf.loc[mydf['new_ID'] == "no match was found", ['old_ID']

this turns out with just the col header, then all blank.

is what i'm trying to do possible in pandas? or maybe i'm stuck in excel ways of thinking, and there is a better/different way!?...

CodePudding user response:

Working off your example im going to assume all new_ID entries are numbers only unless there is no match.

so if your dataframe looks like this (assuming this 2nd column has any values, i didnt know so i put 0's)

new_ID originaldf2
1 0
2 0
3 0
no match 4

Next we can check to see if your new_id column has an id or not by seeing if it contains a number using str.isnumeric()

has_id =df1.new_ID.str.isnumeric()
has_id>>>

0     True
1     True
2     True
3    False
4     True
Name: new_ID, dtype: bool

Then finally we'll use where() what this does it takes the first argument cond that we've passed the has_id bollean filter through and checks whether its True or False. If true, it keeps original value, if false, goes to the argument found in other which in this case we assigned to the second column of our dataframe.

df1.where(has_id, df.iloc[:,1], axis=0)>>>
  new_ID old_df_2
0   1   0
1   2   0
2   3   0
3   4   4

CodePudding user response:

Welcome to Python. What you are trying to do is a straightforward task in pandas. Each column of a pandas Dataframe is a Series object; basically a list of values. You are trying to find which row numbers (aka indeces) satisfy this criteria: new_id == "no match was found". This can be done by pulling the column out of the dataframe and applying a lambda function. I would recommend pasting this code in a new file and playing around to see how it works.

import pandas as pd

# Create test data frame
df = pd.DataFrame(columns=('new_id','old_id'))
df.loc[0] = (1, None)
df.loc[1] = ("no match", 4)
df.loc[2] = (3, None)
df.loc[3] = ("no match", 4)
print("\nHere is our test dataframe:")
print(df)

print("\nShow the values of the 'new_id' that meet our criteria:")
print(df['new_id'][lambda x: x == "no match"])

# Pull the index from these rows
indeces = df['new_id'][lambda x: x == "no match"].index.tolist()
print("\nIndeces:\n", indeces)

print("\nShow only the rows of the data frame that match 'indeces':")
print(df.loc[indeces]['old_id'])

A couple of notes about this code:

  • A lambda function takes each value of a list and does something to it. In this case we are referring to each value of 'new_id' as 'x', and then checking if x == "no match"
  • df.loc[] refers to a specific row of a data frame. df.loc[2] refers to the 3rd row (since pandas data frames are generally zero-indexed)
  • Related