Home > database >  SQL aggregate functions on clickstream data
SQL aggregate functions on clickstream data

Time:12-05

I have a set of clickstream data (ie user id, start and end date/time, active use duration, total duration for each session).

I need to retrieve unique daily data for a given month, sorted by user_id, date, then webOrMobile.

What I have so far, using January 2019 as an example:

SELECT DISTINCT id, CONVERT(DATE, START_DTTM) AS date, WEB_VS_MOBILE_DISPLAY AS webOrMobile, ACTIVE_DURATION_SECONDS 
FROM ClickStream
WHERE START_DTTM LIKE '2019-01-%'
ORDER BY id, START_DTTM, WEB_VS_MOBILE_DISPLAY

which results in:

    id          date        webOrMobile  ACTIVE_DURATION_SECONDS
    
    Z000013917  2019-01-07  Web        245
    Z000013917  2019-01-10  Web        128
    Z000013917  2019-01-29  Web        261
    Z000013917  2019-01-31  Web        1246
    Z0000140154 2019-01-18  Mobile     212
    Z0000140154 2019-01-22  Mobile     18
    Z0000140154 2019-01-23  Mobile     201
    Z0000158605 2019-01-11  Web        67
    Z0000158605 2019-01-11  Web        587
    Z0000158605 2019-01-11  Web        681
    Z0000158605 2019-01-11  Web        1123
    Z0000158605 2019-01-13  Web        2896
    Z0000158605 2019-01-14  Web        20

My issue is the rows with date 2019-01-11, which are identical aside from the duration. How can I combine these rows into one row that has the sum of the individual durations? So something like:

     id          date        webOrMobile  ACTIVE_DURATION_SECONDS
        
        Z000013917  2019-01-07  Web        245
        Z000013917  2019-01-10  Web        128
        Z000013917  2019-01-29  Web        261
        Z000013917  2019-01-31  Web        1246
        Z0000140154 2019-01-18  Mobile     212
        Z0000140154 2019-01-22  Mobile     18
        Z0000140154 2019-01-23  Mobile     201
        Z0000158605 2019-01-11  Web        2458
        Z0000158605 2019-01-13  Web        2896
        Z0000158605 2019-01-14  Web        20

where every row now has a unique id/date/webOrMobile.

Edit: I also tried

SELECT pat_did, CONVERT(DATE, START_DTTM), WEB_VS_MOBILE_DISPLAY, SUM(MYC_ACTIVE_DURATION_SECONDS) FROM ClickStream
WHERE START_DTTM LIKE '2019-01-%'
GROUP BY pat_did, START_DTTM, WEB_VS_MOBILE_DISPLAY

which results in:

    id          date        webOrMobile  ACTIVE_DURATION_SECONDS

    Z000013917  2019-01-07  Web          4900
    Z000013917  2019-01-10  Web          1536
    Z000013917  2019-01-29  Web          3915
    Z000013917  2019-01-31  Web          11214
    Z0000140154 2019-01-18  Mobile       3392
    Z0000140154 2019-01-22  Mobile       54
    Z0000140154 2019-01-23  Mobile       1407
    Z0000158605 2019-01-11  Web          24706
    Z0000158605 2019-01-11  Web          5283
    Z0000158605 2019-01-11  Web          12258
    Z0000158605 2019-01-11  Web          469
    Z0000158605 2019-01-13  Web          112944
    Z0000158605 2019-01-14  Web          5746

The rows did not combine but the duration values are now too high, for example 112944 which equals 31 hours which is impossible.

CodePudding user response:

Any column not appearing in your GROUP BY must only appear in an aggregate function in the SELECT. And you don't use both GROUP BY and DISTINCT. In fact you rarely if ever should use DISTINCT.

So something like

SELECT id, 
       CONVERT(DATE, START_DTTM) AS date, 
       WEB_VS_MOBILE_DISPLAY AS webOrMobile, 
       SUM(ACTIVE_DURATION_SECONDS) ACTIVE_DURATION_SECONDS  
FROM ClickStream
WHERE START_DTTM LIKE '2019-01-%'
GROUP BY ID, START_DTTM, WEB_VS_MOBILE_DISPLAY 
ORDER BY id, START_DTTM, WEB_VS_MOBILE_DISPLAY

CodePudding user response:

I get the impression that you're having issues with duplicate rows and date conversions:

with data as (
    SELECT DISTINCT
        ID, START_DTTM, cast(START_DTTM as date) as "date",
        WEB_VS_MOBILE_DISPLAY as webOrMobile,
        ACTIVE_DURATION_SECONDS as duration
   FROM ClickStream
   WHERE START_DTTM LIKE '2019-01-%'
)
select ID as id, "date", webOrMobile, sum(duration) as duration
from data
group by id, "date", webOrMobile
order by id, "date", webOrMobile;

If there were a need to do it without CTE's/derived tables etc. then you could use a bit of a hack along these lines:

select ID, cast(START_DTTM as date), WEB_VS_MOBILE_DISPLAY as webOrMobile,
    sum(distinct
        datediff(second, cast(START_DTTM as date), START_DTTM) *
          cast(100000000 as bigint)   ACTIVE_DURATION_SECONDS
    ) % 100000000 as duration
from ClickStream
group by ID, cast(START_DTTM as date), WEB_VS_MOBILE_DISPLAY;

You essentially have to pack the date info as a numeric value (that will sum) into high order digits with the duration so that each combo can remain unique, then unpacking by stripping off the lower order digits afterward.

https://dbfiddle.uk/?rdbms=sqlserver_2014&fiddle=23b55e85d42c78e16bab406135fa3291

  • Related