I have an excel-table called rules_table where each row represents a rule with a column representing the resulting category when that rule is true:
Legs | Eyes | Color | Description | Category |
---|---|---|---|---|
8 | 6 | NaN | Small | Spider |
4 | 2 | Black | Friendly | Dog |
2 | NaN | NaN | Tall | Human |
I.e. ignoring the NaN's, the table would create rules as shown in the pseudocode here:
If legs == 8 & Eyes == 6 & Description.contains("Small") then Category = "Spider"
If legs == 4 & Eyes == 2 & Color = "Black" & Description.contains("Friendly") then Category = "Dog"
If legs == 2 & Description.contains("Tall") then Category = "Human"
I also have another table called data_table with same format as the rules_table except it is missing the category column and usually does not contain NaN:
Legs | Eyes | Color | Description |
---|---|---|---|
8 | 6 | Brown | The creature is a small... |
13 | 2 | Orange | This is... |
4 | 2 | Black | This friendly creature... |
2 | 2 | White | The creature here is tall... |
1 | 11 | Yellow | The creature here is... |
My goal is to add the category of the rules_table to the data_table whenever the rule applies, such that executing the code:
complete_table = my_function(rules_table, data_table)
Yields the complete_table:
Legs | Eyes | Color | Description | Category |
---|---|---|---|---|
8 | 6 | Brown | The creature is a small... | Spider |
13 | 2 | Orange | This is... | NaN |
4 | 2 | Black | This friendly creature... | Dog |
2 | 2 | White | The creature here is tall... | Human |
1 | 11 | Yellow | The creature here is... | NaN |
I am currently loading both tables as pandas dataframes, but am open to all options, note that I have millions of rows so efficiency is important to consider (but not critical).
I have tried two approaches
Approach 1: I have tried to join/merge the tables and make a work-around function for executing the "description.contains" part of the rule. However, the NaN's are making it tricky for me, and I am not sure how I should work around that.
Approach 2: I have tried iterating over each row of the rules_table, and then create a list of filters and a list of desired values which I then use together with np.select. However, I cannot figure out how to programatically construct executable code, and therefore end up with strings I cannot use as intended.
Do you have a suggestion for how I may proceed here? I am getting a bit stuck
I can share code if you want, but I am getting stuck on a more fundamental level than just syntax.
CodePudding user response:
If you are familiar with SQL, this problem would have easily solved with its flexible JOIN
statements. In MS SQL Server, you can solve your problem like this:
FROM data_table d
LEFT JOIN rules_table r ON (d.Legs = r.Legs)
AND (d.Eyes = r.Eyes OR r.Eyes IS NULL)
AND (d.Color = r.Color OR r.Color IS NULL)
AND (CHARINDEX(r.Category, d.Category) != -1)
Unfortunately, pandas's joins (as implemented by pd.join
and pd.merge
) are no where as flexible. One way to overcome this is to first perform a cross join and then filter the intermediary result:
def my_function(rules_table, data_table):
# Make a unique number for each row
# To prevent changing the original data_table, we make a new copy
new_data_table = data_table.assign(RowNumber=range(len(data_table)))
# Join every row in new_data_table to every row in rules_table
# We will filter for the matches later
tmp = new_data_table.merge(rules_table, how='cross', suffixes=('', '_rules'))
# Filter for the matches
match = (
( tmp['Legs'] == tmp['Legs_rules'] ) &
((tmp['Eyes'] == tmp['Eyes_rules'] ) | tmp['Eyes_rules'].isna()) &
((tmp['Color'] == tmp['Color_rules']) | tmp['Color_rules'].isna()) &
tmp.apply(lambda row: row['Description_rules'].lower() in row['Description'].lower(), axis=1)
)
# Perform another left join to produce final result
result = new_data_table.merge(tmp.loc[match, ['RowNumber', 'Category']], how='left', on='RowNumber')
return result.drop(columns='RowNumber')