I am pulling data from a database into pandas and it comes in the following format:
OperationID | Procedure | Operation | ParentProcedure | ParentOperation |
---|---|---|---|---|
65510 | Seed | Setup | (empty string) | (empty sting) |
65511 | Seed | Inoc | Seed | Setup |
65512 | Seed | Culture | Seed | Culture |
There are a few thousand rows and the parent of a row is not always the preceding row (but often is). Each row represents an Operation. Each row has a unique ID. The combination of Procedure and Operation in each row should also be unique (neither are unique on their own). The ParentProcedure and ParentOperation identify another row in the table which is the "parent" of the row in question (empty strings for these signify that the operation has no parent). Unfortunately the database doesn't contain the ID of the parent operation, so I want to look this up and add it as a column.
I'm trying to add a new column called ParentOperationID where each entry does a lookup on Procedure and Operation with values from ParentProcedure and ParentOperation and finds the corresponding OperationID from the first hit (should only be one hit anyway).
I can do it in an Excel Table with the following (which must be entered as an array formula i.e. Ctrl Shift Enter), but am struggling to implement something similar in pandas:
=INDEX([OperationID],MATCH([@[ParentProcedure]]&[@[ParentOperation]],[Procedure]&[Operation],0))
The result should look something like this:
OperationID | Procedure | Operation | ParentProcedure | ParentOperation | ParentOperationID |
---|---|---|---|---|---|
65510 | Seed | Setup | (empty string) | (empty sting) | NaN |
65511 | Seed | Inoc | Seed | Setup | 65110 |
65512 | Seed | Culture | Seed | Culture | 65111 |
An operation without a parent should return NaN or None or some similar flag, rather than zero. An error should be thrown if a parent can't be found.
CodePudding user response:
You can self join on your data and select the columns you want.
df = (df.merge(df,
left_on=['Procedure', 'Operation'],
right_on=['ParentProcedure', 'ParentOperation',
suffixes=['', '_parent'])
.rename(columns={'OperationID_parent': 'ParentOperationID'})
)
this will create a new column called OperationID_parent
. You can then rename it to ParentOperationID
.