I need help with SQL Server 2014 (v12.0) filtering out duplicates and this might be too big of an ask, but any help would be appreciated. After importing a table of data, we have some duplicates that we would like to manage.
Example of the data is here:
eCRFNo | NotesURL | Assignee |
---|---|---|
101 | aaa.com | Person 1 |
101 | aaa.com | Person 2 |
101 | aaa.com | Person 3 |
101 | bbb.com | Person 4 |
The goal is to only have 1 unique Control Number (eCRFNo
) in the table. If there are duplicates of eCRFNo
and NotesURL
, then the first one (ordered by eCRFNo, NotesURL, Assignee
) should be picked as the final record and the Assignee's of the other matches be placed in a new column called Watcher (concatenated).
If there is a duplicate eCRFNo
with a NotesURL
that isn't a duplicate, then only the first (ordered by eCRFNo, NotesURL, Assignee
) should be kept and "Replication" be added to a new column called Label.
The results of the example should look like the table below:
eCRFNo | NotesURL | Assignee | Watcher | Label |
---|---|---|---|---|
101 | aaa.com | Person 1 | Person 2, Person 3 | Replication |
It is OK that we lost bbb.com in the data and that we lost Person 4.
I will post the tables if this will post at all. I am having trouble posting with tables and code. I need help with my help.
Here is the section of code that is close to what we want:
WITH cte AS
(
SELECT
Matched.*,
ROW_NUMBER() OVER (PARTITION BY NotesURL ORDER BY NotesURL) routing_dup,
ROW_NUMBER() OVER (PARTITION BY eCRFNo ORDER BY eCRFNo) crf_dup
FROM
Matched
)
UPDATE m
SET m.Watcher = a.Watcher,
m.Label = a.Label,
m.RouteDup = a.routing_dup,
m.CRFDup = a.crf_dup
FROM
(SELECT
cte1.UserID, cte1.NotesURL, cte1.eCRFNo, cte1.UID,
cte1.routing_dup, cte1.crf_dup,
STUFF((SELECT ';' CAST(cte2.UserID AS varchar(max))
FROM cte AS cte2
WHERE cte1.NotesURL = cte2.NotesURL
AND routing_dup > 1
FOR XML PATH ('')), 1, 1, '') AS Watcher,
STUFF((SELECT 'Replication'
FROM cte AS cte2
WHERE cte1.eCRFNo = cte2.eCRFNo
AND cte1.NotesURL <> cte2.NotesURL
AND cte2.CRFDup = 2
AND cte1.CRFDup = 1
FOR XML PATH ('')), 1, 0, '') AS Label
FROM
cte AS cte1) AS a
JOIN
matched m ON m.UID = a.UID
CodePudding user response:
Since you're using MS SQL v12.0 (SQL Server 2014), you'll need to use SQL Server's XML functionality (vs. using STRING_AGG
in SQL Server >= 2017).
Use a CTE
to get the DENSE_RANK
and ROW_NUMBER
, then convert your rows to an XML
string then use SQL Server's XML functionality.
WITH CTE AS (SELECT *,
DENSE_RANK() OVER (PARTITION BY ecRFNo ORDER BY ecRFNo, NotesURL ASC) as DR,
ROW_NUMBER() OVER (PARTITION BY ecRFNo ORDER BY ecRFNo, NotesURL, Assignee ASC) as RN
FROM your_table)
SELECT *,
CASE WHEN a.Watcher IS NOT NULL THEN 'Replication' ELSE 'Unique' END AS Label
FROM (SELECT ecRFNo, NotesURL, Assignee,
stuff( (select DISTINCT ', ' cast(t.Assignee as varchar(max))
from CTE t
WHERE RN > 1 AND t.ecRFNo = a.ecRFNo AND t.NotesURL = a.NotesURL
for xml path ('')
), 1, 2, '') AS Watcher
FROM CTE a WHERE DR = 1 AND RN = 1
GROUP BY ecRFNo, NotesURL, Assignee) a
Note: If duplicates are not found, the column Watcher
will display a NULL
value and the column Label
will display the value Unique
.
If you want to UPDATE
your table and DELETE
the rows that are no longer needed, add your new columns then wrap the query above in another CTE
with an UDPATE
statement:
ALTER TABLE your_table
ADD
Watcher VARCHAR(250),
Label VARCHAR(25);
WITH CTE AS (SELECT *,
DENSE_RANK() OVER (PARTITION BY ecRFNo ORDER BY ecRFNo, NotesURL ASC) as DR,
ROW_NUMBER() OVER (PARTITION BY ecRFNo ORDER BY ecRFNo, NotesURL, Assignee ASC) as RN
FROM your_table),
CTE2 AS
(SELECT *,
CASE WHEN a.Watcher IS NOT NULL THEN 'Replication' ELSE 'Unique' END AS Label
FROM (SELECT ecRFNo, NotesURL, Assignee,
stuff( (select DISTINCT ', ' cast(t.Assignee as varchar(max))
from CTE t
WHERE RN > 1 AND t.ecRFNo = a.ecRFNo AND t.NotesURL = a.NotesURL
for xml path ('')
), 1, 2, '') AS Watcher
FROM CTE a WHERE DR = 1 AND RN = 1
GROUP BY ecRFNo, NotesURL, Assignee) a)
UPDATE your_table
SET your_table.Watcher = b.Watcher, your_table.Label = b.Label
FROM CTE2 b
WHERE your_table.ecRFNo = b.ecRFNo
AND your_table.NotesURL = b.NotesURL
AND your_table.Assignee = b.Assignee
Finally, DELETE
the rows you no longer require by using the Label
column; anything not touched by the UPDATE
above will remain NULL
.
DELETE FROM your_table WHERE Label IS NULL
Result:
| ecRFNO | NotesURL | Assignee | Watcher | Label |
|--------|----------|----------|--------------------|-------------|
| 101 | aaa.com | Person 1 | Person 2, Person 3 | Replication |
Fiddle here.