I am using python and sqllite3
and I was wondering if the performance of this query can be improved?
main
table with ~100,000 rows
0 1 2 3 4 Amount
0 0 9 12 6 60 40800.0
1 0 9 12 6 61 40100.0
2 0 9 12 6 65 39900.0
3 0 9 12 6 74 40300.0
4 0 9 12 7 60 40600.0
util
table ~75,000 rows
0 1 2 Amount
0 78 75 65 9900.0
1 80 75 65 9900.0
2 80 72 65 10000.0
3 78 72 65 10000.0
4 79 75 65 10000.0
The query currently gets the Cartesian product of the two tables where the sum of the amount is between 49,700 and 50,000 and gets the first 200,000 matches if my understanding is correct.
con = sqlite3.connect(':memory:')
df.to_sql(name='main', con=con)
df1.to_sql(name='util', con=con)
query = '''
SELECT *
FROM main AS m
INNER JOIN
util AS u
ON
50000 >= m.Amount u.Amount
AND
49700 <= m.Amount u.Amount
LIMIT
200000;
'''
final_df = pd.read_sql_query(query, con)
CodePudding user response:
Since you're not matching on a column value, but on the expression m.Amount u.Amount
, it has to be computed for every possible combination of rows between the two tables (100k * 75k = 7500mil or 7.5 billion combos). What you've effectively got is a CROSS JOIN since you're not matching on any column between the two tables.
1. You can make sure the expression is evaluated only once rather than for each part of the AND clause 50000 >= m.Amount u.Amount
& 49700 <= m.Amount u.Amount
by using the BETWEEN
operator. I would just the standard 'from table1, table2' with WHERE
for clarity:
SELECT * FROM main AS m
INNER JOIN
util AS u
ON
m.Amount u.Amount BETWEEN 49700 AND 50000
;
2. You'll have to use other methods to reduce the number of rows that are checked. For example, when Amount
for either tables is more than 50,000 it can't be a match, so it gets exclude earlier in the verification and saves time by not computing m.Amount u.Amount
even once:
SELECT * FROM main AS m, util AS u
WHERE
m.Amount <= 50000
AND
u.Amount <= 50000
AND
m.Amount u.Amount BETWEEN 49700 AND 50000
;
If the amounts cannot be 0, then change the <= 50000
to < 50000
.
3. You can do other things like find the minimum Amount in each table and then make sure that the other table's Amount is less than 50000 - that first min amt
.
4. Using the "sum of 2 numbers" problem, you can do a one-time calculation of a minimum match Amt and max match Amt (add two new columns) for one of the tables and then use the BETWEEN check using the Amt from the other table. It still needs to do a cross join but the cpu-time to evaluate each match is reduced.
ALTER TABLE main ADD COLUMN min_match INT default 0;
ALTER TABLE main ADD COLUMN max_match INT default 0;
UPDATE main SET min_match = 49700 - Amount,
max_match = 50000 - Amount;
SELECT * FROM main AS m, util AS u
WHERE
u.Amount BETWEEN m.min_match AND m.max_match
;