Home > Enterprise >  How to rewrite UNION with LEFT JOIN more efficiently
How to rewrite UNION with LEFT JOIN more efficiently

Time:06-29

I have two tables...one that registers users and one that checks in users. A user will always have a single entry in the register table but a user may have 0 or multiple entries in the checkin table. For a raffle selector, I wrote a query that is picking 1 entry from the register table and then 1 entry from the checkin table - each sub query picks a random entry so long as that userID does not exist in a 3rd table that stores the raffle winners. After the two entries are returned than it randomly selects one of the two returned entries as the winnner.

However, I believe there should be a more efficient way of writing this so its ONLY picking an entry once....not picking two entries and then picking one of the two.

It took me quite a while to figure out how to correctly write the below query as I am not proficient in mysql at all. The query works and seems to work efficiently, but I believe there should be a better way of writing it that also consolidates the amount of query code.

Hoping someone here can help or advise.

Table note: clubusers/clubHistory have multiple overlapping columns but the tables are not the same:

register = clubUsers 
checkins = clubHistory
winners = clubRaffleWinners

SELECT * FROM (  
  (SELECT ch.user_ID,ch.clID FROM clubHistory AS ch 
    LEFT OUTER JOIN clubRaffleWinners AS cr1 ON 
      ch.user_ID=cr1.user_ID  
      AND cr1.cID=1157 
      AND cr1.rafID=18  
      AND cr1.crID=1001   
      AND cr1.ceID=1167  
      AND cr1.chDate1='2022-06-04' 
    WHERE  
      ch.cID=1157  
      AND ch.crID=1001   
      AND ch.ceID=1167 
      AND ch.chDate='2022-06-04' 
      AND cr1.user_ID IS NULL 
    GROUP BY ch.user_ID ORDER BY RAND() LIMIT 1 
  )
UNION
  (SELECT cu.user_ID,cu.clID FROM clubUsers AS cu 
    LEFT OUTER JOIN clubRaffleWinners AS cr2 ON 
      cu.user_ID=cr2.user_ID  
      AND cr2.cID=1157 
      AND cr2.rafID=18  
      AND cr2.crID=1001   
      AND cr2.ceID=1167  
      AND cr2.chDate1='2022-06-04' 
    WHERE  
      cu.cID=1157  
      AND cu.crID=1001   
      AND cu.ceID=1167 
      AND cu.calDate<='2022-06-04' 
      AND cr2.user_ID IS NULL 
    GROUP BY cu.user_ID ORDER BY RAND() LIMIT 1   
  )
) AS foo order by RAND() LIMIT 1 ;

UPDATE:

As @JettoMartinez points out below, my current query could in fact randomly return the same user from each table so the final returned entry would just be the same user. I didn't realize this in my struggles just to get the above query to work. Thus my original OP asking for a more optimized query simply selecting a single random entry from both tables (where that user is not already in the winners table) is applicable for yet another reason.

CodePudding user response:

There are two ways I can think of (Do note that since I don't fully understand the tables, I'm not using all the conditions you used in your JOIN statements, meaning it might need more work):

Using a exclusive subquery:

SELECT 
    cu.user_ID,
    cu.clID,
    ch.cID
FROM 
    clubUsers cu
    LEFT JOIN clubHistory ch ON ch.user_ID = cu.user_ID
WHERE user_ID NOT IN (
    SELECT 
        user_ID
    FROM
        clubRaffleWinners
    WHERE
        -- other conditions 
    )
ORDER BY RAND() LIMIT 1;

Using a LEFT "OUTER" JOIN, as you asked for:

SELECT 
    cu.user_ID,
    cu.clID,
    ch.cID -- Or any relevant field from clubHistory, really
FROM 
    clubUsers cu
    LEFT JOIN clubHistory ch ON ch.user_ID = cu.user_ID
    LEFT JOIN clubRaffleWinners cr ON cr.user_ID = cu.user_ID 
        AND ... -- other conditions to ensure uniqueness
        AND ... -- that could also be in the WHERE part
WHERE 
    cr.user_ID IS NULL -- this will filter out the INNER part of the JOIN
ORDER BY RAND() LIMIT 1;

I don't have a dataset to properly test this queries, so please take them as a concept. I also didn't queried in clubHistory since I honestly don't see the point of doing so. Interpolating clubRaggleWinners to clubUsers seems enough for me.

EDIT

Since the user_ID in clubHistory is relevant to the raffle, I added a LEFT JOIN to it and added a field from said table in the SELECT statement, so that the user_id repeats once per entry in clubHistory plus the row of clubUsers, meaning that every user has 1 number of entries / number of users number of entries - number of winners chances to win.

This logic can be applied to the first query with a subquery too, and if the added field needs to be out, the query could be wrapped in a CTE or a subquery.

CodePudding user response:

From what you are describing, and I want to make sure I understand.

Every registered person is qualified 1 entry.

However, each time they have checked in, they get 1 entry for each time they checked in. So, for someone registered and has NEVER checked-in, they get 1 entry. But if someone registered, and checked in 3 times, they would get a total of JUST the 3 times they checked in, vs 4 just for being registered.

Regardless of who is POSSIBLE, you want to EXCLUDE all people who have already been a winner in the raffle.

You SHOULD be able to get results from this below. Since the columns appear to be the same filtering on the cID, crID, ceID and Date, I have the primary FROM based on the registered clubUsers.

From that, a left-join to the clubHistory will either allow that person's ID to be returned once if only registered, OR multiple times based on the times checked in such as the example.

From the given user, I am also directly left-joining to the raffle winning history on the same criteria. If its the same criteria to the club history join, and the same criteria to the raffle (with exception of rafID = 18), appearing to indicate a specific raffle being drawn for, If the person is found, or not, the final WHERE accounts to exclude if its the single entry, or multiple entries via the IS NULL test.

The query will return all entries single or multiple, that have not already won in the order by RAND() qualifier, and apply a single LIMIT 1 to get the final winner. I dont know why you needed what appeared to be the clubhouse ID when you only really care about WHO won, without any regard to being a clubhouse history entry or not.

SELECT 
        cu.user_ID
    FROM 
        clubUsers AS cu
            LEFT JOIN clubHistory ch 
                on cu.user_ID = ch.user_ID  
                AND cu.cID = ch.cID
                AND cu.crID = ch.crID
                AND cu.ceID = ch.ceID
                AND ch.chDate = '2022-06-04' 
            LEFT JOIN clubRaffleWinners AS crw 
                ON cu.user_ID = crw.user_ID  
                AND cu.cID = crw.cID
                AND cu.crID = crw.crID
                AND cu.ceID = crw.ceID
                AND crw.chDate1 = '2022-06-04'
                AND crw.rafID = 18  
    WHERE  
            cu.cID = 1157  
        AND cu.crID = 1001   
        AND cu.ceID = 1167 
        AND cu.calDate <= '2022-06-04' 
        AND crw.user_id IS NULL
    order by 
        RAND() 
    LIMIT 1

For performance purposes, I would ensure the following indexes

table               index 
clubUsers           ( cid, crID, ceID, calDate, user_id )
clubHistory         ( user_id, cID, crID, ceID, chDate )
clubRaffleWinners   ( user_id, cID, crID, ceID, chDate1, rafID )
  • Related