I would like to know how experienced sql users would compute a lot of counts with different conditions. I have a table [population] and a table [sql_rules].
My [population] table looks something like this with ~1.7M rows and ~30 columns.
eye_color | hair_color | age | origin | income | ... |
---|---|---|---|---|---|
blue | brown | 36 | US | 40000 | ... |
green | yellow | 17 | UK | 60000 | ... |
brown | black | 42 | DE | 20000 | ... |
black | black | 28 | DK | 80000 | ... |
... | ... | ... | ... | ... | ... |
My [sql_rules] table looks something like this, with ~800 rows and at maximum 8 rules (average ~5) :
row_number | rule_1 | rule_2 | rule_3 | rule_4 | ... |
---|---|---|---|---|---|
1 | hair_color = 'brown' | age < 27 | origin IN ('US', 'UK') | income >= 40000 | ... |
2 | hair_color = 'black' | origin IN ('DK', 'FR') | age < 10 | income >= 40000 | ... |
3 | hair_color = 'yellow' | origin IN ('TH', 'PE') | age > 34 | ||
4 | hair_color = 'black' | age > 99 | origin IN ('US', 'UK') | income >= 40000 | ... |
5 | age < 27 | income >= 100000 | |||
... | ... | ... | ... | ... | ... |
What I need to do is basically to 'iterate' by row and to compute the count when applying the rules one after the other to obtain one 'count' by cell of my sql_rules table. Basically, since it is hard to explain with words, here are the count I would like to obtain :
row_number | rule_1 | rule_2 | rule_3 | rule_4 | ... |
---|---|---|---|---|---|
1 | SELECT COUNT(*) FROM population WHERE hair_color = 'brown' | SELECT COUNT(*) FROM population WHERE hair_color = 'brown' AND age < 27 | SELECT COUNT(*) FROM population WHERE hair_color = 'brown' AND age < 27 AND origin IN ('US', 'UK') | SELECT COUNT(*) FROM population WHERE hair_color = 'brown' AND age < 27 AND origin IN ('US', 'UK') AND income >= 40000 | ... |
2 | SELECT COUNT(*) FROM population WHERE hair_color = 'black' | SELECT COUNT(*) FROM population WHERE hair_color = 'black' AND origin IN ('DK', 'FR') | SELECT COUNT(*) FROM population WHERE hair_color = 'black' AND origin IN ('DK', 'FR') AND age < 10 | SELECT COUNT(*) FROM population WHERE hair_color = 'black' AND origin IN ('DK', 'FR') AND age < 10 AND income >= 40000 | ... |
3 | ... | ... | ... | ... | ... |
5 | SELECT COUNT(*) FROM population WHERE age < 27 | SELECT COUNT(*) FROM population WHERE age < 27 AND income >= 100000 | |||
... | ... | ... | ... | ... | ... |
What I did for now was to use my table [sql_rules] to create all the SQL 'queries'. Since they can often be similar until the third or forth column, I avoided recomputing the same count multiple times and took the values if a particular 'count' had already been computed.
Can someone think of a fastest way to do this ? My solution is working fine but I can't think of a faster way to do this. I'm able to work with sql, Python, R.
Edit : Ideally, what I'm looking for is some nice ideas that could potentially accelerate the process, for the sake of 'interest'. Here are examples of ideas that are not meant to be great, but to exemplify what I am looking for :
- Using the sql_rules table, let's say with (~800 rows * ~5 rules in average), 4000 rules, create the 4000 sql queries and run them all one after the other until it's done.
- Create the 4000 'queries' as above but once it is done, keep only the unique/distinct rules which would maybe reduce the count to 1000 'SELECT COUNT(*) FROM... .
- loop over each line, create a duplicate of the population table and remove from the duplicate table the rows where the 'rules' are not met. That way, each condition/query/rules of the rest of the row will be computed on a smaller and smaller table.
Edit -------------------------------------
- The table is unfortunately proprietary but I will create something equivalent using public data as soon as possible.
- The sql_rules table is ~800rows and 8 columns (rules). Most of the time, one single row does not contain 8 rules and the a single row is thus looking like that :
row_number | rule_1 | rule_2 | rule_3 | rule_4 | rule_5 | rule_6 | rule_7 | rule_8 |
---|---|---|---|---|---|---|---|---|
hair = 'blue' | eyes IN ('green', 'brown') | income BETWEEN 10000 AND 20000 | NA | NA | NA | NA | NA | NA |
with the values I need to find for this particular row being equivalent to
row_number | rule_1 | rule_2 | rule_3 | rule_4 | rule_5 | rule_6 | rule_7 | rule_8 |
---|---|---|---|---|---|---|---|---|
SELECT COUNT(*) FROM population WHERE hair = 'blue' | SELECT COUNT(*) FROM population WHERE hair = 'blue' AND eyes IN ('green', 'brown') | SELECT COUNT(*) FROM population WHERE hair = 'blue' AND eyes IN ('green', 'brown') AND income BETWEEN 10000 AND 20000 | 0 | 0 | 0 | 0 | 0 | 0 |
CodePudding user response:
You can use conditional sum , still query must be build for each set of rules, but where
part can be easily implemented.
select
sum(1) filter (where hair_color = 'brown') rule1,
sum(1) filter (where age < 27) rule2,
sum(1) filter (where origin in ('US','UK')) rule3
from population