Home > Back-end >  SQL spread out one group proportionally to other groups
SQL spread out one group proportionally to other groups

Time:10-15

So, for example, we have a number of users with different group id. Some of them don't have group:

userID groupID
-------------
user1 group1
user2 group1
user3 group2
user4 group1
user5 NULL
user6 NULL
user7 NULL
user8 NULL

We need to group users by their groupID. And we want users without group (groupID equals NULL) to be assigned to one of existing groups(group1 or group2 in this example). But we want to distribute them proportionally to amount of users already assigned to those groups. In our example group1 has 3 users and group2 has only 1 user. So 75% (3/4) of new users should be counted as members of group 1 and other 25% (1/4) should be "added" to group2. The end result should look like this:

groupID numOfUsers
-------------
group1 6
group2 2

This is a simplified example. Basically we just can't figure out how users without a group can be divided between groups in a certain proportion, not just evenly distributed between them.

We can have any number of groups and users, so we can't just hardcode percentages. Any help is appreciated.

Edit: I tried to use NTILE(), but it gives even distribution, not proportional to amount of users in groups

SELECT userID , 
        NTILE(2) OVER( ) gr
from(
select DISTINCT userID
from test_task
WHERE groupID IS NULL ) AS abc

CodePudding user response:

here is one way:

select 
  groupid
  , count(*) 
    round(count(*) / sum(count(*)) over(),0) * (select count(*) from table where groupid ='no_group')
from table
where groupid <> 'no_group'
group by groupid

CodePudding user response:

We can use an updatable CTE to do this

  • First, we take all existing data, group it up by groupID, and calculate a running sum of the number of rows, as well as the total rows over the whole set
  • We take the rows we want to update and add a row-number (subtract 1 so the calculations work)
  • Join the two based on that row-number modulo the total existing rows should be between the previous running sum and the current running sum
  • Note that this only works well when there are a divisible number of rows eg. 4 or 8, by 4 existing rows
WITH Groups AS (
    SELECT
      groupID,
      perGroup = COUNT(*),
      total = SUM(COUNT(*)) OVER (),
      runningSum = SUM(COUNT(*)) OVER (ORDER BY groupID ROWS UNBOUNDED PRECEDING)
    FROM test_task
    WHERE groupID IS NOT NULL
    GROUP BY groupID
),
ToUpdate AS (
    SELECT
      groupID,
      userID,
      rn = ROW_NUMBER() OVER (ORDER BY userID) - 1
    FROM test_task tt
    WHERE groupID IS NULL
)
UPDATE u
SET groupID = g.groupID
FROM ToUpdate u
JOIN Groups g
  ON u.rn % (g.total) >= g.runningSum - g.perGroup
 AND u.rn % (g.total) < g.runningSum;

db<>fiddle

  • Related