I'm using Spark SQL on Databricks but this may apply to other SQL dialects as well.
I wonder why PIVOT only works when I select all columns
SELECT *
FROM events
PIVOT (
COUNT(*)
FOR event_name IN ('cart', 'pillows', 'login', 'main', 'careers', 'guest', 'faq', 'down', 'warranty')
)
BUT stops working when I specify certain columns
SELECT user_id, event_name
FROM events
PIVOT (
COUNT(*)
FOR event_name IN ('cart', 'pillows', 'login', 'main', 'careers', 'guest', 'faq', 'down', 'warranty')
)
The error message is quite verbose, here is the first line saying 'event_name' doesn't exist.
Error in SQL statement: AnalysisException: Column 'event_name' does not exist. Did you mean one of the following? [finalize, foam, guest, original, premium ...
CodePudding user response:
The problem that your query doesn't have event_name
column in its output.
The Pivot
clause will create a separate column for each value of event_name
that you filtered using this expression:
FOR event_name IN ('cart', 'pillows', 'login', 'main', 'careers', 'guest', 'faq', 'down', 'warranty')
If you select one (or more) of those columns, it should work.
For example, try:
SELECT user_id, cart, pillows
FROM events
PIVOT (
COUNT(*)
FOR event_name IN ('cart', 'pillows', 'login', 'main', 'careers', 'guest', 'faq', 'down', 'warranty')
)
You can check here the official docs with a few examples: https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-pivot.html#examples