Home > front end >  mysql group by with having removes results from sum over
mysql group by with having removes results from sum over

Time:10-31

I have a query that is grouping items, selecting the number of items in a group, and also what percentage that group is of all of the items that were not filtered out in the where clause

I would like to remove any groups that have two or fewer items in them in order to have fewer results that are a small fraction of the entire dataset.

here is my (simplified) query:

select count(i.itemId) as itemCount,
       concat(format(100 * count(i.itemId) / sum(count(i.itemId)) over (), 2), '%') as totalPercentage
from thing t
join item i on t.thingId = i.thingId
where t.createdDate > startdate and t.createdDate < enddate
group by t.thingId
order by count(i.itemId) desc, t.thingId desc;

I thought that I should just add a having clause like this:

...
group by t.thingId
having count(i.itemId) > 2
order by count(i.itemId) desc, t.thingId desc;

would solve the problem, but when I do that the results show a total percentage that is now inaccurate, because the sum(count(i.itemId)) over () is now ignoring the groups that have 2 or fewer items in them.

I know that it is possible to do this by making this query an inner query and then filtering in a select that is outside this query, but I would prefer not to do that, given that on my team we try to avoid inner queries unless they are necessary.

I also know that it is possible to make a temp table, and filter this by selecting from that temp table, but I don't even want to go into that, because it seems ugly.

TLDR: Is it possible to filter out groups from the result of a select, but have them still included in a sum() over () clause without any shenanigans?

CodePudding user response:

Calculate first the percentage and use that as subquery and then filter later, as any filtering will remove rows that you need

you ned up with

SELECT
    itemCount,totalPercentage
FROM
    (select 
        t.thingId,
        count(i.itemId) as itemCount,
        concat(format(100 * count(i.itemId) / sum(count(i.itemId)) over (), 2), '%') as totalPercentage
    from thing t
    join item i on t.thingId = i.thingId
    where t.createdDate > startdate and t.createdDate < enddate
    group by t.thingId) t1
WHERE intemCount = 2
order by itemCount desc, thingId desc;

CodePudding user response:

Generally speaking it's not possible to compute percentages in SQL without the help of a subquery. To generate percentages you need a total to divide by and that takes a separate aggregating query.

I hope your team's policy about subqueries actually says "avoid dependent a/k/a correlated subqueries whenever possible." Those are the ones that can hammer performance. Avoiding all subqueries as a matter of policy? That's sort of like avoiding subroutines in programming -- it prohibits the use of a major language feature.

CodePudding user response:

you can try case statements so that it gives null percentage then later filter out null values in next stage be it sql or some front end code.

...
case when count(i.itemId) > 2 
then concat(format(100 * count(i.itemId) / sum(count(i.itemId)) over (), 2), '%') end as 
totalPercentage
...

this might just increase computation time, or just an outer query to filter it out can be used as mentioned in other answers.

  • Related