I have the following data that has certain rows which are implicitly duplicated, but don't appear as such. I want to identify them based on certain conditions, compare them and then remove them.
| ID | COUNTRY | NAME | UNITS |
|------|---------|-------|-------|
| 1000 | Canada | Andy | 1 |
| 1000 | Canada | Andy | 2 |
| 1000 | USA | Andy | 3 |
| 1000 | China | Bob | 3 |
| 2000 | Canada | Chase | 2 |
| 2000 | Canada | David | 3 |
| 2000 | USA | David | 4 |
| 3000 | Canada | John | 1 |
| 3000 | USA | John | 1 |
| 3000 | China | Dave | 3 |
Condition 1: For ID = 1000, Andy is shown to be part of Canada and USA. In this case, I want to compare the sum of units for the two countries and keep the first country (alphabetically) if the sum of the Units is same. If the sum of Units for USA is greater, then keep the second Country (e.g., USA)
Output:
| ID | COUNTRY | NAME | UNITS |
|------|---------|-------|-------|
| 1000 | Canada | Andy | 1 |
| 1000 | Canada | Andy | 2 |
| 1000 | China | Bob | 3 |
Condition 2: For ID = 2000 and name David, the Country with the most Units is USA. So, I want to keep the second row in that case.
Output:
| ID | COUNTRY | NAME | UNITS |
|------|---------|-------|-------|
| 2000 | Canada | Chase | 2 |
| 2000 | USA | David | 4 |
Condition 3: For ID = 3000, the units in both case is the same, so keep the first country alphabetically.
Output:
| ID | COUNTRY | NAME | UNITS |
|------|---------|-------|-------|
| 3000 | Canada | John | 1 |
| 3000 | China | Dave | 3 |
Final Output
| ID | COUNTRY | NAME | UNITS |
|------|---------|-------|-------|
| 1000 | Canada | Andy | 1 |
| 1000 | Canada | Andy | 2 |
| 1000 | China | Bob | 3 |
| 2000 | Canada | Chase | 2 |
| 2000 | USA | David | 4 |
| 3000 | Canada | John | 1 |
| 3000 | China | Dave | 3 |
DDL Statement
CREATE TABLE #tmptbl (
id INT,
type VARCHAR(20),
name VARCHAR(20),
qty INT
)
INSERT INTO #tmptbl VALUES
(1000, 'Canada', 'Andy', 1),
(1000, 'Canada', 'Andy', 2),
(1000, 'USA', 'Andy', 3),
(1000, 'China', 'Bob', 3),
(2000, 'Canada', 'Chase', 2),
(2000, 'Canada', 'David', 3),
(2000, 'USA', 'David', 4),
(3000, 'Canada', 'John', 1),
(3000, 'USA', 'John', 1),
(3000, 'China', 'Dave', 3)
CodePudding user response:
Seems like this is a spin on the classic Get top 1 row of each group, but where you need to first get the windowed SUM
and instead use DENSE_RANK
(or RANK
):
WITH Totals AS
(SELECT tt.ID,
tt.Country,
tt.Name,
tt.Units,
SUM(tt.Units) OVER (PARTITION BY tt.ID, tt.Name, tt.Country) AS TotalUnits
FROM #tmptbl tt),
Ranks AS
(SELECT T.ID,
T.Country,
T.Name,
T.Units,
T.TotalUnits,
DENSE_RANK() OVER (PARTITION BY ID, Name ORDER BY T.TotalUnits DESC, T.Country ASC) AS DenseRank
FROM Totals T)
SELECT R.ID,
R.Country,
R.Name,
R.Units
FROM Ranks R
WHERE R.DenseRank = 1
ORDER BY ID,
Name,
Units;