Home > Software engineering >  use distinct within case statement
use distinct within case statement

Time:05-19

I have a query that uses multiple left joins and trying to get a SUM of values from one of the joined columns.

SELECT
SUM( case when session.usersessionrun =1 then 1 else 0 end) new_unique_session_user_count
FROM session
LEFT JOIN appuser ON appuser.appid = '6279df3bd2d3352aed591583'
            AND appuser.userid = session.userid
LEFT JOIN userdevice ON userdevice.appid = '6279df3bd2d3352aed591583'
            AND userdevice.userid = appuser.userid
WHERE session.appid = '6279df3bd2d3352aed591583'
 AND (session.uploadedon BETWEEN '2022-04-18 08:31:26' AND '2022-05-18 08:31:26')

But this obviously gives a redundant session.usersessionrun=1 counts since it's a joined resultset.

Here the logic was to mark the user as new if the sessionrun for that record is 1.

I grouped by userid and usersessionrun and it shows that the records are repeated.

userid. sessionrun. count
628212  1           2
627a01  1           4

So what I was trying to do was something like

SUM(CASE distinct(session.userid) AND WHEN session.usersessionrun = 1 THEN 1 ELSE 0 END) new_unique_session_user_count

i.e. for every unique user count, session.usersessionrun = 1 should only be done once.

CodePudding user response:

As you have discovered, JOIN operations can generate combinatorial explosions of data.

You need a subquery to count your sessions by userid. Then you can treat the subquery as a virtual table and JOIN it to the other tables to get the information you need in your result set.

The subquery (nothing in my answer is debugged):

            SELECT COUNT(*) new_unique_session_user_count, 
                   session.userid
              FROM session
             WHERE session.appid = '6279df3bd2d3352aed591583'
               AND session.uploadedon BETWEEN '2022-04-18 08:31:26'
                                          AND '2022-05-18 08:31:26'
               AND session.usersessionrun = 1
               AND session.appid = '6279df3bd2d3352aed591583'
             GROUP BY userid

This subquery summarizes your session table and has one row per userid. The trick to avoiding JOIN-created combinatorial explosions is using subqueries that generate results with only one row per data item mentioned in a JOIN's ON-clause.

Then, you join it with the other tables like this

SELECT summary.new_unique_session_user_count
  FROM (
            SELECT COUNT(*) new_unique_session_user_count, 
                   session.userid
              FROM session
             WHERE session.appid = '6279df3bd2d3352aed591583'
               AND session.uploadedon BETWEEN '2022-04-18 08:31:26'
                                          AND '2022-05-18 08:31:26'
               AND session.usersessionrun = 1
               AND session.appid = '6279df3bd2d3352aed591583'
             GROUP BY userid
       ) summary
  JOIN appuser ON appuser.appid = '6279df3bd2d3352aed591583'
            AND appuser.userid = summary.userid
  JOIN userdevice ON userdevice.appid = '6279df3bd2d3352aed591583'
            AND userdevice.userid = appuser.userid

There may be better ways to structure this query, but it's hard to guess at them without more information about your table definitions and business rules.

  • Related