Create new pandas column based on conditions stated in a table-CodePudding

I have an excel-table called rules_table where each row represents a rule with a column representing the resulting category when that rule is true:

Legs	Eyes	Color	Description	Category
8	6	NaN	Small	Spider
4	2	Black	Friendly	Dog
2	NaN	NaN	Tall	Human

I.e. ignoring the NaN's, the table would create rules as shown in the pseudocode here:

If legs == 8 & Eyes == 6 & Description.contains("Small") then Category = "Spider"
If legs == 4 & Eyes == 2 & Color = "Black" & Description.contains("Friendly") then Category = "Dog"
If legs == 2 & Description.contains("Tall") then Category = "Human"

I also have another table called data_table with same format as the rules_table except it is missing the category column and usually does not contain NaN:

Legs	Eyes	Color	Description
8	6	Brown	The creature is a small...
13	2	Orange	This is...
4	2	Black	This friendly creature...
2	2	White	The creature here is tall...
1	11	Yellow	The creature here is...

My goal is to add the category of the rules_table to the data_table whenever the rule applies, such that executing the code:

complete_table = my_function(rules_table, data_table)

Yields the complete_table:

Legs	Eyes	Color	Description	Category
8	6	Brown	The creature is a small...	Spider
13	2	Orange	This is...	NaN
4	2	Black	This friendly creature...	Dog
2	2	White	The creature here is tall...	Human
1	11	Yellow	The creature here is...	NaN

I am currently loading both tables as pandas dataframes, but am open to all options, note that I have millions of rows so efficiency is important to consider (but not critical).

I have tried two approaches

Approach 1: I have tried to join/merge the tables and make a work-around function for executing the "description.contains" part of the rule. However, the NaN's are making it tricky for me, and I am not sure how I should work around that.

Approach 2: I have tried iterating over each row of the rules_table, and then create a list of filters and a list of desired values which I then use together with np.select. However, I cannot figure out how to programatically construct executable code, and therefore end up with strings I cannot use as intended.

Do you have a suggestion for how I may proceed here? I am getting a bit stuck

I can share code if you want, but I am getting stuck on a more fundamental level than just syntax.

CodePudding user response：

If you are familiar with SQL, this problem would have easily solved with its flexible JOIN statements. In MS SQL Server, you can solve your problem like this:

FROM      data_table   d 
LEFT JOIN rules_table  r   ON (d.Legs  = r.Legs)
                          AND (d.Eyes  = r.Eyes  OR r.Eyes  IS NULL)
                          AND (d.Color = r.Color OR r.Color IS NULL)
                          AND (CHARINDEX(r.Category, d.Category) != -1)

Unfortunately, pandas's joins (as implemented by pd.join and pd.merge) are no where as flexible. One way to overcome this is to first perform a cross join and then filter the intermediary result:

def my_function(rules_table, data_table):
    # Make a unique number for each row
    # To prevent changing the original data_table, we make a new copy
    new_data_table = data_table.assign(RowNumber=range(len(data_table)))

    # Join every row in new_data_table to every row in rules_table
    # We will filter for the matches later
    tmp = new_data_table.merge(rules_table, how='cross', suffixes=('', '_rules'))

    # Filter for the matches
    match = (
        ( tmp['Legs']  == tmp['Legs_rules'] ) &
        ((tmp['Eyes']  == tmp['Eyes_rules'] ) | tmp['Eyes_rules'].isna()) &
        ((tmp['Color'] == tmp['Color_rules']) | tmp['Color_rules'].isna()) &
        tmp.apply(lambda row: row['Description_rules'].lower() in row['Description'].lower(), axis=1)
    )

    # Perform another left join to produce final result
    result = new_data_table.merge(tmp.loc[match, ['RowNumber', 'Category']], how='left', on='RowNumber')
    return result.drop(columns='RowNumber')