I have to tables in Redshift, articles and clicks:
articles
| articleID | authorID |
| 100 | 2 |
| 101 | 2 |
| 102 | 6 |
| 103 | 7 |
| 104 | 2 |
clicks
|articleID | category |
|100 | "mail" |
|100 | "mail" |
|100 | "rss" |
|101 | "rss" |
|101 | "mail" |
|101 | "app" |
|101 | "app" |
Now I want to aggregate for each articleID the number of clicks per category as well as the total amount of clicks with this query
SELECT clicks.articleID,
ANY_VALUE(article.authorID) AS authorID,
COUNT(CASE WHEN clicks.category ='rss' THEN 1 END) as rss,
COUNT(CASE WHEN clicks.category ='mail' THEN 1 END) as mail,
COUNT(CASE WHEN clicks.category ='app' THEN 1 END) as app,
COUNT(clicks.articleID) AS total
FROM clicks
INNER JOIN articles
ON clicks.articleID = articles.articleID
GROUP BY clicks.articleID
It results in
| articleID | authorID | rss | mail | app | total |
| 100 | 2 | 1 | 2 | 0 | 3 |
| 101 | 2 | 1 | 1 | 2 | 4 |
which is fine. Now I want to set the category colums dynamically. So when an additional category in "clicks" appears, e.g. "facebook", I do not have to alter the query for getting
| articleID | authorID | rss | mail | app | facebook | total |
| 100 | 2 | 1 | 2 | 0 | 1 | 4 |
| 101 | 2 | 1 | 1 | 2 | 0 | 4 |
I found this one
WITH categories AS (SELECT category from clicks GROUP BY category)
as a prefix to the query But how do I access categories in the main query, something like
WITH categories AS (SELECT category from clicks GROUP BY category)
SELECT clicks.articleID,
ANY_VALUE(article.authorID) AS authorID,
--pseudocode
FOREACH(categories AS category)
COUNT(CASE WHEN clicks.category =$category THEN 1 END) as $category
ENDFOREACH
--//pseudocode
COUNT(clicks.articleID) AS total
FROM clicks
INNER JOIN articles
ON clicks.articleID = articles.articleID
GROUP BY clicks.articleID
?
Of course I could first query all categories and then run a query for each category, but I hope that there is a more elegant way
CodePudding user response:
Two things - first there is now PIVOT functionality in Redshift so you can don't need all those case statements. So it would look like:
select *
from (
select category, c.articleid, authorid
from clicks c
join articles a on c.articleid = a.articleid)
PIVOT ( count(category) for category in ('rss', 'mail', 'app') );
The second thing is that you want dynamic SQL. You see the list of categories is "baked" into the SQL either as the PIVOT list or the GROUP BY list or ... this isn't something that can be made dynamic inside the database. So you will need to have something gather the list of categories and then issue the SQL for the final query. Lambda can be this something but so can many other tools. Like so:
select '\'' || listagg(distinct category, '\',\'') || '\'' from clicks;
then something takes this list and creates the query from a template such as:
select *
from (
select category, c.articleid, authorid
from clicks c
join articles a on c.articleid = a.articleid)
PIVOT ( count(category) for category in (<INSERT RESULT FROM ABOVE HERE>) );
Sorry this is the best you can do. Queries are compiled BEFORE any data is seen.