Home > front end >  Querying for type breakdown based on COUNT results
Querying for type breakdown based on COUNT results

Time:10-02

I have the following table <state_table> that tracks entries per minute of an id and its state:

    minute id type
    ------ -- ----
    1      A  solid
    1      A  solid
    1      A  solid
    1      A  liquid
    1      B  solid
    1      B  solid
    1      B  liquid
    .... 1000  rows ...
    1      ZZX  liquid
    1      ZZZ  liquid
    2      A  solid
    2      A  solid
    2      A  liquid

With the following query, I can get the top 1000 pairs based on occurrence:

With TempIds AS (
   SELECT
      state_table.minute as minute,
      state_table.id as id,
      COUNT(*)
   FROM
      state_table
GROUP BY 1,2
) SELECT
   TempId.minute,
   TempId.id,
   TempId.count
FROM
   TempIds
ORDER BY 3 DESC
LIMIT 1000
;

e.g.

    minute id count
    ------ -- ----
    2      B  1002
    3      A  990
    1      C  800
    3      B  798

How can I modify my query to get the type of the id? For example, there are 1002 <minute=2,id=B> rows. Is there a way to get there are 402 solids and 600 liquids?

    minute id count type
    ------ -- ---- -----
    2      B  402  solid
    2      B  600  liquid
    3      A  330  solid
    3      A  660  liquid

The only way I can think of is a fairly complex nested query:

With TempTop AS (
    With TempIds AS (
       SELECT
          state_table.minute as minute,
          state_table.id as id,
          COUNT(*)
       FROM
          state_table
    GROUP BY 1,2
    ) SELECT
       TempId.minute as minute,
       TempId.id as id,
       TempId.count
    FROM
       TempIds
    ORDER BY 3 DESC
    LIMIT 1000
    )
) SELECT
    state_table.minute,
    state_table.id,
    state_table.type,
    COUNT(*)
FROM
   state_table, TempTop
WHERE
    state_table.minute = TempTop.minute
    AND state_table.id = TempTop.id
;

Is there a simpler way to make this query? Goal:

  • For the Top 1000 most frequent pairs, get the breakdown of the type.

CodePudding user response:

Your query is

SELECT minute, id, COUNT(*)
FROM state_table
GROUP BY minute, id
ORDER BY COUNT(*) DESC
LIMIT 1000;

and in this process you lose the types, because you want the 1000 top minute/id pairs, so you cannot simply group by minute, id, type instead.

If this is only about the types 'solid' and 'liquid', you can apply conditional aggregation to get the separate counts along:

SELECT 
  minute, id, 
  COUNT(*) AS total,
  SUM(type = 'solid') AS solid,
  SUM(type = 'liquid') AS liquid
FROM state_table
GROUP BY minute, id
ORDER BY COUNT(*) DESC
LIMIT 1000;

Summing up the boolean expressions works in MySQL, because true equals 1 and false equals 0 there.

The problem with the above queries are ties, by the way. If there are two pairs with the same 1000th count, you pick one arbitrarily instead of showing only 999 or 1001 pairs then in order to treat both tying pairs the same. So I'd probably re-write the queries using DENSE_RANK in order to properly handle ties.

For the more complicated case where the types are unknown at the point of writing the query, you need rows instead of columns, just as already shown in your request. In that case you really need to group by minute, id, type first. The easiest way to get the total counts is with SUM OVER then. Then rank the pairs as mentioned with DENSE_RANK and keep the top 1000.

SELECT minute, id, type, cnt
FROM
(
  SELECT
    minute, id, type, cnt,
    DENSE_RANK() OVER (ORDER BY total DESC) AS rnk
  FROM
  (
    SELECT
      minute, id, type,
      COUNT(*) AS cnt,
      SUM(COUNT(*)) OVER(PARTITION BY minute, id) AS total
    FROM state_table
    GROUP BY minute, id, type
  ) counted
) ranked
WHERE rnk <= 1000
ORDER BY rnk, minute, id, type;

This can get you more than 1000 minute/id pairs in case of ties. You can reduce this with RANK instead of DENSE_RANK. If these two approaches still don't get the number you want, you may have to count minute/id pairs separately in a subquery instead.

CodePudding user response:

Best way I can think of is, you still have to use 2 CTEs but not nested ones.

You have 3 tasks to achieve,

  1. CTE will count the records.

  2. second one will use row_number() window function to assign row numbers based on decreasing order or count.

  3. last select will do a join between state table and second cte with a where clause specifying row_nunber <= 1000 and grouping based on ID , State and Minute.

    With TempIdsCount AS (
    SELECT
    state_table.minute as minute,
    state_table.id as id,
    COUNT() rw_cnt
    FROM
    state_table GROUP BY 1,2 ) TempIdsRowNum AS (
    SELECT
    minute,
    id,
    Row_number() over ( order by rw_cnt desc ) rw_num
    FROM
    TempIdsCount )
    SELECT
    st.minute,
    st.id, st.type,
    Count(
    ) cnt FROM
    state_table st Join TempIdsRowNum trn on st.minute = trn.minute and st.id = trn.id Where rw_num <= 1000 Group by st.minute, st.id, st.type ;

  • Related