How to write a subquery to optimise performance?-CodePudding

I have the following query that shows total sales for the selected dimensions. Table a does not contain product_name, this is why I've joined data with table b on product_id.

However, table b is too big, and I'd like to optimize it to scan fewer data.

SELECT a.date,
         a.hour,
         a.category_id,
         a.product_id,
         b.product_name,  
         sum(a.sales) AS sales
FROM  a
LEFT JOIN b
    ON a.product_id = b.product_id
WHERE date(a.date) >= date('2021-01-01') 
      AND date(B.date) = date('2021-01-01')
GROUP BY  1, 2, 3, 4, 5

What would be your suggestions here?

CodePudding user response：

There are two ways to decrease the amount of data Athena needs to scan for a given query:

Make sure the table is partitioned, and make sure the query makes use of the partitioning.
Store the data as Parquet or ORC.

These two can be used separately or in combination. Best results are achieved with the combination, but sometimes that's not convenient or possible.

Your question doesn't say if the tables are partitioned, but from the query it looks to me like they are not – unless date is a partition key.

date would be an excellent partition key, and if it is, your query is already pretty good. AND date(B.date) = date('2021-01-01') will limit the scan of the table b to a single partition. However, if date is not a partition key then what will happen is that Athena will have to scan the whole table to find rows that match the criteria.

This is where a file format like Parquet and ORC can help; these store the data for each column separately, and also store metadata like the min and max values for each column. If the files for the b table were sorted by date, or at least created over time in such a way that they were mostly sorted by date, Athena would be able to look at the metadata and skip files that can't contain the sought date because it's outside of the range given by the min/max values for that file. Athena would also only have to read the parts of the files for the b table that contained the date column, because that is the only one used in the query.

If you amend your question with a little more information about the table schemas and how the data is stored I can answer in more detail how to optimise. With the available information I can only give general guidance as above.

CodePudding user response：

Make sure b table has indexes on date and product_id, as Stu's comment suggests
Run an Explain Plan (from console) on your SQL to see whether optimizer filters b before joining to a. If it already does so, you're done - step 3 won't help
Replace your From a Left Join b with From a Left Join (Select product_id, product_name from b where date(date) = date('2021-01-01')) b