The problem is, I need to calculate average number of pages/hits after reaching the pax page (including pax hit).
The database is:
CREATE TABLE search (
SESSION_ID INTEGER,
HIT_NUMBER INTEGER,
PAGE VARCHAR(24),
MEDIUM_T VARCHAR(24)
);
INSERT INTO search
(SESSION_ID, HIT_NUMBER, PAGE, MEDIUM_T)
VALUES
('123', '1', 'home', 'direct'),
('123', '2', 'flights_home', 'direct'),
('123', '3', 'results', 'direct'),
('456', '1', 'pax', 'metasearch'),
('789', '1', 'home', 'partners'),
('789', '2', 'flights_home', 'partners'),
('789', '3', 'results', 'partners'),
('789', '4', 'home', 'partners'),
('146', '1', 'results', 'SEM'),
('146', '2', 'pax', 'SEM'),
('146', '3', 'payment', 'SEM'),
('146', '4', 'confirmation', 'SEM');
And my approach is:
SELECT s1.SESSION_ID, COUNT(*) as sCOUNT
FROM search s1
WHERE PAGE = 'pax'
GROUP BY s1.SESSION_ID
UNION ALL
SELECT 'Total AVG', AVG(a.sCOUNT) FROM
(SELECT COUNT(*) as sCOUNT
FROM search s2
GROUP BY s2.SESSION_ID) a
Obviously the 3r line is wrong, my code misses the part in which after 'pax' is shown starts counting and I don't have any clue for that.
Thank you in advanced :)
CodePudding user response:
My approach uses WITH CTE (common-table-expression) to pre-declare what the underlying query basis is, then querying and averaging from that.
First one premise that was not explicitly covered in your sample data. What happens IF a user bounces back and forth between multiple pages and hits the PAX page more than once. You now have multiple pax page hits. I would assume you want the FIRST instance to such pax page and that is inclusive of all page hits. This solution should help account for it.
Lets look at the inner-most from clause with final alias "pxHits".
I am grouping by session ID and grabbing the FIRST INSTANCE of a pax page hit (or null if no such pax page encountered), but ALSO grabbing the HIGHEST hit number per session. The HAVING clause will make sure that it only returns those sessions that HAD a PAX page are returned leaving all other sessions excluded from the results.
This would result with two entries passed up to the outer select which includes the 1 lastHitNumber - firstPaxHit
calculation. The reason for the 1 is because you at least HIT the page once. But, in the scenario of your session 456 where the first and last hit WERE the first page, you need that since the lastHitNumber - firstPaxHit would net zero. This would be true if a person had 25 page hits and got to the pax page on page 26. Your result would still be 1 via 1 26 - 26 = 1 total page including the pax page, not the 25 prior to.
Your other qualifying session would be 146. The first pax hit was 2 but they proceeded to a highest page hit of 4. so 1 4 - 2 = 3 total pages.
So now on to the final. Since you can see the HOW things are prepared, we can now get the averages. You can't mix/auto convert different data types (session_id vs the fixed message of your 'Total Avg'. They must be the same type. So my query is converting the session_id to character to match. I happen to be getting the AVERAGE query first as a simple select from the WITH CTE alias, and THEN getting the actual session_id and counts.
with PaxSummary as
(
select
pxHits.*,
1 lastHitNumber - firstPaxHit HitsIncludingPax
from
( select
session_id,
min( case when page = 'pax'
then hit_number
else null end ) firstPaxHit,
max( hit_number ) lastHitNumber
from
search
group by
session_id
having
min( case when page = 'pax'
then hit_number
else null end ) > 0 ) pxHits
)
select
'Avg Pax Pages' FinalMsg,
avg( ps2.HitsIncludingPax ) HitsIncludingPax
from
PaxSummary ps2
union all
select
cast( ps1.session_id as varchar) FinalMsg,
ps1.HitsIncludingPax
from
PaxSummary ps1
CodePudding user response:
Finding all pax
pages and the ones after it could be done with exists
. Rest is straight forward:
SELECT AVG(hits)
FROM (
SELECT session_id, COUNT(*) AS hits
FROM search AS s1
WHERE page = 'pax' OR EXISTS (
SELECT *
FROM search AS s2
WHERE s2.session_id = s1.session_id
AND s2.hit_number < s1.hit_number
AND s2.page = 'pax'
)
GROUP BY session_id
) AS x
If using MySQL 8 then window functions provide a simpler solution:
WITH cte1 AS (
SELECT session_id, MAX(CASE WHEN page = 'pax' THEN 1 END) OVER (
PARTITION BY session_id
ORDER BY hit_number
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS countme
FROM search
), cte2 as (
SELECT COUNT(*) AS hits
FROM cte1
WHERE countme IS NOT NULL
GROUP BY session_id
)
SELECT AVG(hits)
FROM cte2
CodePudding user response:
As an alternative to the EXISTS (correlated subquery)
pattern, we can write a query that gets us the hit_number of the first 'pax' hit for each session_id, and use that as an inline view.
Something along these lines:
-- count hits on or after the first 'pax' of each session_id that has a 'pax' hit
SELECT s.session_id
, COUNT(*) AS cnt_hits_after_pax
FROM ( -- get the first 'pax' hit for each session_id
-- exclude session_id that do not have a 'pax' hit
SELECT px.session_id AS pax_session_id
, MIN(px.hit_number) AS pax_hit_number
FROM search px
WHERE px.page = 'pax'
) p
-- all the hits for session_id on or after the first 'pax' hit
JOIN search s
ON s.session_id = p.session_id
AND s.hit_number >= p.hit_number
GROUP BY s.session_id
to get an average from that query, we can wrap it parens and turn it into an inline view
SELECT AVG(c.cnt_hits_after_pax) AS avg_cnt_hits_after_pax
FROM (
-- query above goes here
) c