RedShift: need helps for optimizations subquery WHERE IN (SELECT *)-CodePudding

I have next query to RedShift:

SELECT contributor_user_id,
            device_id_source,
            device_os,
            device_model,
            device_design,
            device_serial,
            device_carrier,
            device_os_version,
            device_manufacturer,
            device_current_app_build,
            device_current_app_version
    FROM all_values
    WHERE all_values.device_id_source :: VARCHAR NOT IN (SELECT device_id_source FROM table WHERE device_id_source IS NOT NULL)
            AND all_values.device_os :: VARCHAR NOT IN (SELECT device_os FROM table WHERE device_os IS NOT NULL)
            AND all_values.device_model :: VARCHAR NOT IN (SELECT device_os FROM table WHERE device_model IS NOT NULL)
            AND all_values.device_design :: VARCHAR NOT IN (SELECT device_os FROM table WHERE device_design IS NOT NULL)
            AND all_values.device_serial :: VARCHAR NOT IN (SELECT device_os FROM table WHERE device_serial IS NOT NULL)
            AND all_values.device_carrier :: VARCHAR NOT IN (SELECT device_os FROM table WHERE device_carrier IS NOT NULL)
            AND all_values.device_os_version :: VARCHAR NOT IN (SELECT device_os FROM table WHERE device_os_version IS NOT NULL)
            AND all_values.device_manufacturer :: VARCHAR NOT IN (SELECT device_os FROM table WHERE device_manufacturer IS NOT NULL)
            AND all_values.device_current_app_build :: VARCHAR NOT IN (SELECT device_os FROM table WHERE device_current_app_build IS NOT NULL)
            AND all_values.device_current_app_version :: VARCHAR NOT IN (SELECT device_os FROM table WHERE device_current_app_version IS NOT NULL)
  )

As I know, WHERE IN (SELECT) works slowly than "JOIN" and there are many identical requests in subquery and I think that it's not good. But I'm newbie in SQL and I don't know how I can rewrite the code above with JOIN. Could you help me with knowledge?

Thnx!

CodePudding user response：

At the least, make the subqueries select distinct device_os.

CodePudding user response：

The "WHERE NOT IN (SELECT ..." can be very expensive as the list can be very long and take a lot of comparisons to determine if the value is not in the list. A somewhat less expensive way to do this is with "WHERE NOT EXISTS (SELECT ..." which is more of a JOIN structure internally but still may not be fast enough for your case.

Note these are just guesses based on your SQL and past experience. Given how simple the rest of the query looks it is a good bet. You may still want to look at the EXPLAIN plan for the query and see where the cost is increasing the most.

The best answer is to rethink this query and remove the negative logic. If I'm reading this right you want to find all the rows in contributor_user_id where the corresponding column value in "table" for ANY of the listed columns are NULL. To do this you are performing a subtraction algorithm using "WHERE NOT IN". I don't know your data model so I'm not sure if this logic is not correct.

The difficulty here is that I don't know your data and data-model. The query will flag any row that any column being NULL in "table" but only if there are no repeats of device_os in "table". For example one row in "table" with NULL for device_model but is not NULL for device_design in another row and has the same device_os value will not be flagged. It all depends on what the legal patterns are in your data. Are multiple rows with the same device_os legal in your data?

A better way is to make this into an additive algorithm which may greatly reduce the work needed to get the desired answer. Not understanding the data and the desired logic it is impossible for me to propose a solution. Example data and expected results would help in making a different solution proposal.