Given the following table:
CREATE TABLE main
(
`job_id` UUID,
`request_time` DateTime,
`host_id` UInt8,
`status_code` LowCardinality(String),
)
ENGINE = MergeTree
ORDER BY request_time
SETTINGS index_granularity = 8192
I'm trying to fetch all statuses by host, with their relative percentage. In order to do that, I need to count results grouped by host and status, and divide each counted field by its total host fields count.
For example, this query would have worked in MySQL:
SELECT
main.host_id,
main.status_code,
COUNT() AS status_count,
COUNT() / sub.host_total * 100 AS percent
FROM
main
INNER JOIN (
SELECT host_id, COUNT() AS host_total
FROM main
GROUP BY host_id
) AS sub ON (sub.host_id = main.host_id)
GROUP BY
main.host_id,
main.status_code
But ClickHouse throws:
DB::Exception: Unknown identifier: host_total; there are columns: host_id, status_code, count(): While processing host_id, status_code, count() AS status_count, (count() / host_total) * 100 AS percent. (UNKNOWN_IDENTIFIER)
Probably because correlated (dependent) subqueries are not supported.
I was suggested to use CTE, so I tried this:
WITH sub AS (
SELECT host_id, COUNT() AS host_total
FROM main
GROUP BY host_id
)
SELECT
main.host_id,
main.status_code,
COUNT() AS status_count,
COUNT() / (SELECT host_total FROM sub WHERE sub.host_id = main.host_id) * 100 AS percent
FROM
main
GROUP BY
main.host_id,
main.status_code
But still no luck:
DB::Exception: Missing columns: 'main.host_id' while processing query: 'SELECT host_total FROM sub WHERE host_id = main.host_id', required columns: 'host_total' 'host_id' 'main.host_id' 'host_total' 'host_id' 'main.host_id': While processing (SELECT host_total FROM sub WHERE sub.host_id = main.host_id) AS _subquery20: While processing count() / ((SELECT host_total FROM sub WHERE sub.host_id = main.host_id) AS _subquery20): While processing (count() / ((SELECT host_total FROM sub WHERE sub.host_id = main.host_id) AS _subquery20)) * 100 AS percent. (UNKNOWN_IDENTIFIER)
CodePudding user response:
CH throws an incorrect error. https://github.com/ClickHouse/ClickHouse/issues/4567
host_total should be in groupby section or under aggregate function
insert into main(request_time, host_id,status_code) values ( now(), 1, 200);
insert into main(request_time, host_id,status_code) values ( now(), 1, 500);
SELECT
main.host_id,
main.status_code,
COUNT() AS status_count,
COUNT() / any(sub.host_total) * 100 AS percent
FROM
main
INNER JOIN (
SELECT host_id, COUNT() AS host_total
FROM main
GROUP BY host_id
) AS sub ON (sub.host_id = main.host_id)
GROUP BY
main.host_id,
main.status_code
┌─host_id─┬─status_code─┬─status_count─┬─percent─┐
│ 1 │ 200 │ 1 │ 50 │
│ 1 │ 500 │ 1 │ 50 │
└─────────┴─────────────┴──────────────┴─────────┘
But there is better ways to solve it:
window functions
select host_id, status_code, status_count, status_count / host_total * 100 AS percent
from (
SELECT
host_id,
status_code,
COUNT() AS status_count,
COUNT() over (partition by host_id) host_total
from main
GROUP BY
host_id,
status_code
);
┌─host_id─┬─status_code─┬─status_count─┬─percent─┐
│ 1 │ 200 │ 1 │ 50 │
│ 1 │ 500 │ 1 │ 50 │
└─────────┴─────────────┴──────────────┴─────────┘
arrays
select host_id, status_code, status_count, status_count / host_total * 100 AS percent
from (
SELECT
host_id,
sumMap([status_code], [1]) ga,
count() host_total
from main
GROUP BY host_id
) array join ga.1 as status_code, ga.2 as status_count
┌─host_id─┬─status_code─┬─status_count─┬─percent─┐
│ 1 │ 200 │ 1 │ 50 │
│ 1 │ 500 │ 1 │ 50 │
└─────────┴─────────────┴──────────────┴─────────┘