Home > OS >  How to self join only a subset of rows in PostgreSQL?
How to self join only a subset of rows in PostgreSQL?

Time:06-30

Given the following table:

CREATE TABLE people (
    name TEXT PRIMARY KEY,
    age  INT  NOT NULL
);

INSERT INTO people VALUES
    ('Lisa',     30),
    ('Marta',    27),
    ('John',     32),
    ('Sam',      41),
    ('Alex',     12),
    ('Aristides',43),
    ('Cindi',     1)
    ;

I am using a self join to compare each value of a specific column with all the other values of the same column. My query looks something like this:

SELECT DISTINCT A.name as child
FROM people A, people B
WHERE A.age   16 < B.age;

This query aims to spot potential sons/daughters based on age difference. More specifically, my goal is to identify the set of people that may have stayed in the same house as one of their parents (ordered by name), assuming that there must be an age difference of at least 16 years between a child and their parents.

Now I would like to combine this kind of logic with the information that is in another table.

The other table looks something like that:

CREATE TABLE houses (
    house_name TEXT NOT NULL,
    house_member TEXT NOT NULL REFERENCES people(name)
);    

INSERT INTO houses VALUES
        ('house Smith', 'Lisa'),
        ('house Smith', 'Marta'),
        ('house Smith', 'John'),
        ('house Doe', 'Lisa'),
        ('house Doe', 'Marta'),
        ('house Doe', 'Alex'),
        ('house Doe', 'Sam'),
        ('house McKenny', 'Aristides'),
        ('house McKenny', 'John'),
        ('house McKenny', 'Cindi')
        ;

The two tables can be joined ON houses.house_member = people.name.

More specifically I would like to spot the children only within the same house. It does not make sense to compare the age of each person with the age of all the others, but instead it would be more efficient to compare the age of each person with all the other people in the same house.

My idea is to perform the self join from above but only within a PARTITION BY household_name. However, I don't think this is a good idea since I do not have an aggregate function. Same applies for GROUP BY statements as well. What could I do here?

The expected output should be the following, ordered by house_member:

house_member
Alex
Cindi

For simplicity I have created a fiddle.

CodePudding user response:

At first join two tables to build one table that has all three bits of info: house_name, house_member, age.

And then join it with itself just as you did originally and add one extra filter to look only at the same households.

WITH
CTE_All
AS
(
    SELECT
        houses.house_name
        ,houses.house_member
        ,people.age
    FROM
        houses
        INNER JOIN people ON people.name = houses.house_member
)
SELECT DISTINCT
    Children.house_name
    ,Children.house_member AS child_name
FROM
    CTE_All AS Children
    INNER JOIN CTE_All AS Parents 
        ON Children.age   16 < Parents.age
        -- this is our age difference
        AND Children.house_name = Parents.house_name
        -- within the same house
;

All this is one single query. You don't have to use CTE, you can inline it as a subquery, but it is more readable with CTE.

Result

house_name    | child_name
:------------ | :---------
house Doe     | Alex      
house McKenny | Cindi     
  • Related