Home > Blockchain >  Create new pandas column based on conditions stated in a table
Create new pandas column based on conditions stated in a table

Time:10-14

I have an excel-table called rules_table where each row represents a rule with a column representing the resulting category when that rule is true:

Legs Eyes Color Description Category
8 6 NaN Small Spider
4 2 Black Friendly Dog
2 NaN NaN Tall Human

I.e. ignoring the NaN's, the table would create rules as shown in the pseudocode here:

If legs == 8 & Eyes == 6 & Description.contains("Small") then Category = "Spider"
If legs == 4 & Eyes == 2 & Color = "Black" & Description.contains("Friendly") then Category = "Dog"
If legs == 2 & Description.contains("Tall") then Category = "Human"

I also have another table called data_table with same format as the rules_table except it is missing the category column and usually does not contain NaN:

Legs Eyes Color Description
8 6 Brown The creature is a small...
13 2 Orange This is...
4 2 Black This friendly creature...
2 2 White The creature here is tall...
1 11 Yellow The creature here is...

My goal is to add the category of the rules_table to the data_table whenever the rule applies, such that executing the code:

complete_table = my_function(rules_table, data_table)

Yields the complete_table:

Legs Eyes Color Description Category
8 6 Brown The creature is a small... Spider
13 2 Orange This is... NaN
4 2 Black This friendly creature... Dog
2 2 White The creature here is tall... Human
1 11 Yellow The creature here is... NaN

I am currently loading both tables as pandas dataframes, but am open to all options, note that I have millions of rows so efficiency is important to consider (but not critical).

I have tried two approaches

Approach 1: I have tried to join/merge the tables and make a work-around function for executing the "description.contains" part of the rule. However, the NaN's are making it tricky for me, and I am not sure how I should work around that.

Approach 2: I have tried iterating over each row of the rules_table, and then create a list of filters and a list of desired values which I then use together with np.select. However, I cannot figure out how to programatically construct executable code, and therefore end up with strings I cannot use as intended.

Do you have a suggestion for how I may proceed here? I am getting a bit stuck

I can share code if you want, but I am getting stuck on a more fundamental level than just syntax.

CodePudding user response:

If you are familiar with SQL, this problem would have easily solved with its flexible JOIN statements. In MS SQL Server, you can solve your problem like this:

FROM      data_table   d 
LEFT JOIN rules_table  r   ON (d.Legs  = r.Legs)
                          AND (d.Eyes  = r.Eyes  OR r.Eyes  IS NULL)
                          AND (d.Color = r.Color OR r.Color IS NULL)
                          AND (CHARINDEX(r.Category, d.Category) != -1)

Unfortunately, pandas's joins (as implemented by pd.join and pd.merge) are no where as flexible. One way to overcome this is to first perform a cross join and then filter the intermediary result:

def my_function(rules_table, data_table):
    # Make a unique number for each row
    # To prevent changing the original data_table, we make a new copy
    new_data_table = data_table.assign(RowNumber=range(len(data_table)))

    # Join every row in new_data_table to every row in rules_table
    # We will filter for the matches later
    tmp = new_data_table.merge(rules_table, how='cross', suffixes=('', '_rules'))

    # Filter for the matches
    match = (
        ( tmp['Legs']  == tmp['Legs_rules'] ) &
        ((tmp['Eyes']  == tmp['Eyes_rules'] ) | tmp['Eyes_rules'].isna()) &
        ((tmp['Color'] == tmp['Color_rules']) | tmp['Color_rules'].isna()) &
        tmp.apply(lambda row: row['Description_rules'].lower() in row['Description'].lower(), axis=1)
    )

    # Perform another left join to produce final result
    result = new_data_table.merge(tmp.loc[match, ['RowNumber', 'Category']], how='left', on='RowNumber')
    return result.drop(columns='RowNumber')
  • Related