I have a query that I run hourly and I am processing a certain dataset from this query. While processing this dataset, I need to ignore some IDs, I am currently doing this with NOT IN
, but the number of IDs I need to ignore is around 50.
The question I am wondering is, I am creating a text file in a certain pattern with the data I am processing, should I use this ignore operation directly in the query or inside the foreach pattern for better performance?
Query returns around 5000-7000 data in a dataset consists of 10M record, and I need to ignore around 50 ID from resultset.
Lets say;
$blacklist_arr = array(1,10,20,30,40,50,60,70,80,90,100); //around 50 element in array~
What I use now;
...QUERY...
resultSet.ID NOT IN (\'' . implode( "', '" , $blacklist_arr ) . '\')
What I'm planning to use;
foreach ($final_dataset as $final_data) {
...
if (!in_array($final_data, $blacklist_arr )) {
//write to file
...
edit* Query structure is below;
SELECT *
FROM
(
(
SELECT DISTINCT a.col1, a.col2, a.col3, a.col4,..., a.coln
FROM
`a`
INNER JOIN ( SELECT MAX( b.col4 ) AS X, b.col2 FROM `a` AS `b` GROUP BY b.col2 ORDER BY NULL ) sub ON ( sub.X = a.col4 )
WHERE
( a.someColumn > NOW( ) - INTERVAL 2 HOUR )
AND ( a.col3 < DATE_HERE )
) UNION
(
SELECT a.col1, a.col2, a.col3, a.col4,..., a.coln
FROM
`a`
WHERE
( a.someColumn >= DATE_SUB( NOW( ), INTERVAL 3 MONTH ) AND a.col4 IS NULL )
AND ( a.col3 < DATE_HERE )
)
) AS resultSet
WHERE
resultSet.col1 NOT IN ( 1,10,20,30,40,50,60,70,80,90,100 )
ORDER BY
resultSet.col3 ASC,
resultSet.col2 ASC,
resultSet.col4 ASC,
resultSet.col1 DESC
CodePudding user response:
If your t.col_black_elem is obtained by another query you could try using a left join an check for not matching value
SELECT a.col1,..., a.coln
from table1 a
LEFT JOIN (
select col_black_elem from tablex
) t on t.col_black_elem = a.colx
WHERE t.col_black_elem is null
and for youe code
SELECT *
FROM
(
(
SELECT DISTINCT a.col1, a.col2, a.col3, a.col4,..., a.coln
FROM
`a`
INNER JOIN ( SELECT MAX( b.col4 ) AS X, b.col2 FROM `a` AS `b` GROUP BY b.col2 ORDER BY NULL ) sub ON ( sub.X = a.col4 )
WHERE
( a.someColumn > NOW( ) - INTERVAL 2 HOUR )
AND ( a.col3 < DATE_HERE )
) UNION
(
SELECT a.col1, a.col2, a.col3, a.col4,..., a.coln
FROM
`a`
WHERE
( a.someColumn >= DATE_SUB( NOW( ), INTERVAL 3 MONTH ) AND a.col4 IS NULL )
AND ( a.col3 < DATE_HERE )
)
) AS resultSet
LEFT JOIN (
select col_black_elem from tablex
) t on t.col_black_elem = resultSet.col1
WHERE t.col_black_elem is null
ORDER BY
resultSet.col3 ASC,
resultSet.col2 ASC,
resultSet.col4 ASC,
resultSet.col1 DESC
Otherwise if your t.col_black_elem is not obtained by a nother query you could popolate a temp table ora buld dinamycally a temp table using several select union
CodePudding user response:
From the performance looking point I recomment you:
- Remove DISTINCT in 1st subquery. One sorting is better then two sortings.
- Filter your rows in subqueries, not in combined rowset, this will decrease the amount of rows to be sorted by UNION.
SELECT *
FROM
(
(
SELECT a.col1, a.col2, a.col3, a.col4,..., a.coln
FROM
`a`
INNER JOIN ( SELECT MAX( b.col4 ) AS X, b.col2 FROM `a` AS `b` GROUP BY b.col2 ORDER BY NULL ) sub ON ( sub.X = a.col4 )
WHERE
( a.someColumn > NOW( ) - INTERVAL 2 HOUR )
AND ( a.col3 < DATE_HERE )
AND a.col1 NOT IN ( 1,10,20,30,40,50,60,70,80,90,100 )
) UNION
(
SELECT a.col1, a.col2, a.col3, a.col4,..., a.coln
FROM
`a`
WHERE
( a.someColumn >= DATE_SUB( NOW( ), INTERVAL 3 MONTH ) AND a.col4 IS NULL )
AND ( a.col3 < DATE_HERE )
AND a.col1 NOT IN ( 1,10,20,30,40,50,60,70,80,90,100 )
)
) AS resultSet
ORDER BY
resultSet.col3 ASC,
resultSet.col2 ASC,
resultSet.col4 ASC,
resultSet.col1 DESC
CodePudding user response:
A variety of points:
I have a "Rule of Thumb": "If a possible optimization is estimated to improve things by less than 10%, move on. That is, don't spend extra effort on it. Instead, look for something better to work on." According to your numbers, the optimization decreases the result set by only about 1%.
There is a standard programming rule: "KISS". Which is simpler to code -- the
NOT IN
or the PHP filtering? A variant: "Which approach is fewer keystrokes?" That comes from "A Programmer's time is much more valuable than computer time.Moving the
NOT IN
into each subquery may speed it up slightly. This is because it would decrease (slightly) the intermediate tables involved in the query. (However, this fails the 10% and KISS rules.) On the other hand, it could eliminate the outermost Select. Note: This works:(SELECT ...) UNION (SELECT ...) ORDER BY...
.Potential bug: The innermost Select may be picking a date & time from one of the excluded col1's.
UNION
defaults toUNION DISTINCT
, which is slower thanUNION ALL
. Consider this as a bigger optimization.ON ( sub.X = a.col4)
probably needs to mentioncol2
.Is
DATE_HERE
somehow related toNOW()
? Perhaps you needTIMESTAMP
instead ofDATETIME
or vice versa?I suspect that the
DISTINCT
is not needed. Anyway, it is redundant with theUNION
.Consider whether the "blacklist" should be a table, not a config file. As a table,
NOT EXISTS(..)
orLEFT JOIN .. IS NOT NULL
would need to be added to the query. This would be slower than what you have now but might be "cleaner".WHERE 1=1
is an artifact of lazy programming; it is not an optimization; the Optimizer will simply toss it.Often, better indexes provide the most improvement. Maybe the following would help. Note: Separate, single-column indexes are not as good. Also, when adding
INDEX(a,b)
, dropINDEX(a)
.a (as b): INDEX(col2, col4) -- this order a: INDEX(col4, col3, someColumn) -- col4 first