Home > database >  Add column to pandas DF that does a lookup within the DF based on values of two other columns in the
Add column to pandas DF that does a lookup within the DF based on values of two other columns in the

Time:11-09

I am pulling data from a database into pandas and it comes in the following format:

OperationID Procedure Operation ParentProcedure ParentOperation
65510 Seed Setup (empty string) (empty sting)
65511 Seed Inoc Seed Setup
65512 Seed Culture Seed Culture

There are a few thousand rows and the parent of a row is not always the preceding row (but often is). Each row represents an Operation. Each row has a unique ID. The combination of Procedure and Operation in each row should also be unique (neither are unique on their own). The ParentProcedure and ParentOperation identify another row in the table which is the "parent" of the row in question (empty strings for these signify that the operation has no parent). Unfortunately the database doesn't contain the ID of the parent operation, so I want to look this up and add it as a column.

I'm trying to add a new column called ParentOperationID where each entry does a lookup on Procedure and Operation with values from ParentProcedure and ParentOperation and finds the corresponding OperationID from the first hit (should only be one hit anyway).

I can do it in an Excel Table with the following (which must be entered as an array formula i.e. Ctrl Shift Enter), but am struggling to implement something similar in pandas:

=INDEX([OperationID],MATCH([@[ParentProcedure]]&[@[ParentOperation]],[Procedure]&[Operation],0))

The result should look something like this:

OperationID Procedure Operation ParentProcedure ParentOperation ParentOperationID
65510 Seed Setup (empty string) (empty sting) NaN
65511 Seed Inoc Seed Setup 65110
65512 Seed Culture Seed Culture 65111

An operation without a parent should return NaN or None or some similar flag, rather than zero. An error should be thrown if a parent can't be found.

CodePudding user response:

You can self join on your data and select the columns you want.

df = (df.merge(df, 
         left_on=['Procedure', 'Operation'], 
         right_on=['ParentProcedure', 'ParentOperation', 
         suffixes=['', '_parent'])
    .rename(columns={'OperationID_parent': 'ParentOperationID'})
)

this will create a new column called OperationID_parent. You can then rename it to ParentOperationID.

  • Related