Home > Software engineering >  Find rows with duplicate values in two columns where at least one value in one column is a specific
Find rows with duplicate values in two columns where at least one value in one column is a specific

Time:04-29

So, I'm not sure how it works and I haven't found a sufficient answer by googleing (probably not using the right buzz words). So here it comes: Let's say I have a table like this, let's call it persons

ID Name First Name Country
1 Doe John USA
2 Doe John UK
3 Doe John Brazil
4 Meyer Julia Germany
5 Meyer Julia Austria
6 Picard Jean-Luc France
7 Picard Jean-Luc UK
8 Nakamura Hikaro Japan

Ok, so now I want to select all the rows that have the same name and first name and where at least one country is the UK. So my result set should look like this.

ID Name First_Name Country
1 Doe John USA
2 Doe John UK
3 Doe John Brazil
6 Picard Jean-Luc France
7 Picard Jean-Luc UK

I mean, I know how to find doubles in general like this

SELECT *
FROM persons  p1
JOIN (SELECT NAME, FIRST_NAME, count(*) FROM PERSONS  
GROUP BY FIRST_NAME, NAME having count(*) >1) p2
ON p1.NAME = p2.NAME 
AND p1.FIRST_NAME = p2.FIRST_NAME; 

But this also results in having Julia Meyer in there and I don't want her in there.

Any suggestions?

CodePudding user response:

Count country of interest conditionally

SELECT *
FROM persons  p1
JOIN (
   SELECT NAME, FIRST_NAME
   FROM PERSONS  
   GROUP BY FIRST_NAME, NAME 
   having count(*) > 1 and count(case country = 'UK' then 1 end) >= 1
) p2 ON p1.NAME = p2.NAME 
   AND p1.FIRST_NAME = p2.FIRST_NAME; 

CodePudding user response:

Use EXISTS:

SELECT p1.*
FROM persons p1
WHERE EXISTS (
  SELECT *
  FROM persons p2
  WHERE p2.ID <> p1.ID
    AND p2.Name = p1.Name AND p2.FirstName = p1.FirstName 
    AND 'UK' IN (p1.Country, p2.Country)
);

See the demo.

CodePudding user response:

I want to select all the rows that have the same name and first name and where at least one country is the UK.

You can use the COUNT analytic function with conditional aggregation (and solve the problem in a single table scan and without any self-joins):

SELECT id, name, first_name, country
FROM   (
  SELECT t.*,
         COUNT(CASE country WHEN 'UK' THEN 1 END)
           OVER (PARTITION BY name, first_name) AS cnt
  FROM  table_name t
)
WHERE  cnt > 0;

Which, for the sample data:

CREATE TABLE table_name (ID, Name, First_Name, Country) AS
SELECT 1, 'Doe',    'John',     'USA' FROM DUAL UNION ALL
SELECT 2, 'Doe',    'John',     'UK' FROM DUAL UNION ALL
SELECT 3, 'Doe',    'John',     'Brazil' FROM DUAL UNION ALL
SELECT 4, 'Meyer',  'Julia',    'Germany' FROM DUAL UNION ALL
SELECT 5, 'Meyer',  'Julia',    'Austria' FROM DUAL UNION ALL
SELECT 6, 'Picard',     'Jean-Luc',     'France' FROM DUAL UNION ALL
SELECT 7, 'Picard',     'Jean-Luc',     'UK' FROM DUAL UNION ALL
SELECT 8, 'Nakamura',   'Hikaro',   'Japan' FROM DUAL;

Outputs:

ID NAME FIRST_NAME COUNTRY
1 Doe John USA
2 Doe John UK
3 Doe John Brazil
6 Picard Jean-Luc France
7 Picard Jean-Luc UK

If you want to find duplicate rows where at least one is UK then also count all the rows in the partition:

SELECT id, name, first_name, country
FROM   (
  SELECT t.*,
         COUNT(CASE country WHEN 'UK' THEN 1 END)
           OVER (PARTITION BY name, first_name) AS cnt_uk,
         COUNT(*)
           OVER (PARTITION BY name, first_name) AS cnt_all
  FROM  table_name t
)
WHERE  cnt_uk > 0
AND    cnt_all >= 2;

Which gives the same output for the sample data.

db<>fiddle here

CodePudding user response:

There are 2 conditions, 1) containing 'UK' AND 2) having count(1) > 1. So, this query below would work.

SELECT p1.*
FROM persons p1
WHERE (p1.NAME, p1.FIRST_NAME) IN (
    SELECT p2.NAME, p2.FIRST_NAME
    FROM persons p2
    WHERE p2.Country = 'UK') AND
  AND (p1.NAME, p1.FIRST_NAME) IN (
    SELECT p3.NAME, p3.FIRST_NAME
    FROM persons p3
    GROUP BY p3.NAME, p3.FIRST_NAME
    HAVING COUNT(1) > 1)
  • Related