Home > Enterprise >  Theoretical - Fastest way to compute different counts
Theoretical - Fastest way to compute different counts

Time:07-01

I would like to know how experienced sql users would compute a lot of counts with different conditions. I have a table [population] and a table [sql_rules].

My [population] table looks something like this with ~1.7M rows and ~30 columns.

eye_color hair_color age origin income ...
blue brown 36 US 40000 ...
green yellow 17 UK 60000 ...
brown black 42 DE 20000 ...
black black 28 DK 80000 ...
... ... ... ... ... ...

My [sql_rules] table looks something like this, with ~800 rows and at maximum 8 rules (average ~5) :

row_number rule_1 rule_2 rule_3 rule_4 ...
1 hair_color = 'brown' age < 27 origin IN ('US', 'UK') income >= 40000 ...
2 hair_color = 'black' origin IN ('DK', 'FR') age < 10 income >= 40000 ...
3 hair_color = 'yellow' origin IN ('TH', 'PE') age > 34
4 hair_color = 'black' age > 99 origin IN ('US', 'UK') income >= 40000 ...
5 age < 27 income >= 100000
... ... ... ... ... ...

What I need to do is basically to 'iterate' by row and to compute the count when applying the rules one after the other to obtain one 'count' by cell of my sql_rules table. Basically, since it is hard to explain with words, here are the count I would like to obtain :

row_number rule_1 rule_2 rule_3 rule_4 ...
1 SELECT COUNT(*) FROM population WHERE hair_color = 'brown' SELECT COUNT(*) FROM population WHERE hair_color = 'brown' AND age < 27 SELECT COUNT(*) FROM population WHERE hair_color = 'brown' AND age < 27 AND origin IN ('US', 'UK') SELECT COUNT(*) FROM population WHERE hair_color = 'brown' AND age < 27 AND origin IN ('US', 'UK') AND income >= 40000 ...
2 SELECT COUNT(*) FROM population WHERE hair_color = 'black' SELECT COUNT(*) FROM population WHERE hair_color = 'black' AND origin IN ('DK', 'FR') SELECT COUNT(*) FROM population WHERE hair_color = 'black' AND origin IN ('DK', 'FR') AND age < 10 SELECT COUNT(*) FROM population WHERE hair_color = 'black' AND origin IN ('DK', 'FR') AND age < 10 AND income >= 40000 ...
3 ... ... ... ... ...
5 SELECT COUNT(*) FROM population WHERE age < 27 SELECT COUNT(*) FROM population WHERE age < 27 AND income >= 100000
... ... ... ... ... ...

What I did for now was to use my table [sql_rules] to create all the SQL 'queries'. Since they can often be similar until the third or forth column, I avoided recomputing the same count multiple times and took the values if a particular 'count' had already been computed.

Can someone think of a fastest way to do this ? My solution is working fine but I can't think of a faster way to do this. I'm able to work with sql, Python, R.

Edit : Ideally, what I'm looking for is some nice ideas that could potentially accelerate the process, for the sake of 'interest'. Here are examples of ideas that are not meant to be great, but to exemplify what I am looking for :

  1. Using the sql_rules table, let's say with (~800 rows * ~5 rules in average), 4000 rules, create the 4000 sql queries and run them all one after the other until it's done.
  2. Create the 4000 'queries' as above but once it is done, keep only the unique/distinct rules which would maybe reduce the count to 1000 'SELECT COUNT(*) FROM... .
  3. loop over each line, create a duplicate of the population table and remove from the duplicate table the rows where the 'rules' are not met. That way, each condition/query/rules of the rest of the row will be computed on a smaller and smaller table.

Edit -------------------------------------

  • The table is unfortunately proprietary but I will create something equivalent using public data as soon as possible.
  • The sql_rules table is ~800rows and 8 columns (rules). Most of the time, one single row does not contain 8 rules and the a single row is thus looking like that :
row_number rule_1 rule_2 rule_3 rule_4 rule_5 rule_6 rule_7 rule_8
hair = 'blue' eyes IN ('green', 'brown') income BETWEEN 10000 AND 20000 NA NA NA NA NA NA

with the values I need to find for this particular row being equivalent to

row_number rule_1 rule_2 rule_3 rule_4 rule_5 rule_6 rule_7 rule_8
SELECT COUNT(*) FROM population WHERE hair = 'blue' SELECT COUNT(*) FROM population WHERE hair = 'blue' AND eyes IN ('green', 'brown') SELECT COUNT(*) FROM population WHERE hair = 'blue' AND eyes IN ('green', 'brown') AND income BETWEEN 10000 AND 20000 0 0 0 0 0 0

CodePudding user response:

You can use conditional sum , still query must be build for each set of rules, but where part can be easily implemented.

select 
  sum(1) filter (where hair_color = 'brown') rule1,
  sum(1) filter (where age < 27) rule2,
  sum(1) filter (where origin in ('US','UK')) rule3
 from population

DB Fiddle

  • Related