Postgresql : which way is faster DISTINCT or GROUP BY?-CodePudding

There are 2 ways to do the same thing.

Example: get distinct names from persons database

The first way is:

SELECT name 
FROM person 
GROUP BY name

Has the same result as:

SELECT DISTINCT name 
FROM person

What I am curious of, is there anything different in the way an Postgresql SQL engine processes the command and which way is faster, or are they doing the same thing?

CodePudding user response：

When it comes to DISTINCT or GROUP BY being used, it depends on whether you care about the performance or not.

Using GROUP BY is inherently slower due to the added step of sorting, which is performed at the end of the execution, whereas DISTINCT does not perform this action.

Do check that GROUP BY provides the results you're looking for though, as in some cases the results of the query differ whether it is using DISTINCT or GROUP BY .

CodePudding user response：

For only a few records (f.e. 100000) it doesn't really matter. Both will use the same HashAggregate method.

And then a golfcoder would prefere DISTINCT because it has a bit shorter syntax.
The GROUP BY is more intended to be used with aggregation functions anyway, like MAX, SUM, COUNT, AVG, ...

But for a larger recordset there's a difference.

For example in this demo

create table Persons (
 Name varchar(30)
)

INSERT INTO Persons (Name)
SELECT
    arrays.firstnames[s.a % ARRAY_LENGTH(arrays.firstnames,1)   1]
 || arrays.lastnames[s.a % ARRAY_LENGTH(arrays.lastnames,1)   1] AS name
FROM     generate_series(1,600000) AS s(a) -- number of names to generate
CROSS JOIN(
    SELECT ARRAY[
    'Adam','Bill','Bob','Calvin','Donald','Dwight','Frank','Fred','George','Howard',
    'James','John','Jacob','Jack','Martin','Matthew','Max','Michael','Lukas', 
    'Paul','Peter','Phil','Roland','Ronald','Samuel','Steve','Theo','Warren','William',
    'Abigail','Alice','Allison','Amanda','Anne','Barbara','Betty','Carol','Cleo','Donna',
    'Jane','Jennifer','Julie','Martha','Mary','Melissa','Patty','Sarah','Simone','Susan'
    ] AS firstnames,
    ARRAY[
        'Matthews','Smith','Jones','Davis','Jacobson','Williams','Donaldson','Maxwell','Peterson','Storms','Stevens',
        'Franklin','Washington','Jefferson','Adams','Jackson','Johnson','Lincoln','Grant','Fillmore','Harding','Taft',
        'Truman','Nixon','Ford','Carter','Reagan','Bush','Clinton','Hancock'
    ] AS lastnames
) AS arrays

select count(*) from Persons

|  count |
| -----: |
| 600000 |

explain analyse
select distinct Name from Persons

| QUERY PLAN                                                                                                           |
| :------------------------------------------------------------------------------------------------------------------- |
| HashAggregate  (cost=6393.82..6395.82 rows=200 width=78) (actual time=194.609..194.757 rows=1470 loops=1)            |
|   Group Key: name                                                                                                    |
|   ->  Seq Scan on persons  (cost=0.00..5766.66 rows=250866 width=78) (actual time=0.030..61.243 rows=600000 loops=1) |
| Planning time: 0.259 ms                                                                                              |
| Execution time: 194.898 ms                                                                                           |

explain analyse
select Name from Persons group by Name

| QUERY PLAN                                                                                                                                      |
| :---------------------------------------------------------------------------------------------------------------------------------------------- |
| Group  (cost=5623.88..5625.88 rows=200 width=78) (actual time=226.358..227.145 rows=1470 loops=1)                                               |
|   Group Key: name                                                                                                                               |
|   ->  Sort  (cost=5623.88..5624.88 rows=400 width=78) (actual time=226.356..226.596 rows=4410 loops=1)                                          |
|         Sort Key: name                                                                                                                          |
|         Sort Method: quicksort  Memory: 403kB                                                                                                   |
|         ->  Gather  (cost=5564.59..5606.59 rows=400 width=78) (actual time=206.700..219.546 rows=4410 loops=1)                                  |
|               Workers Planned: 2                                                                                                                |
|               Workers Launched: 2                                                                                                               |
|               ->  Partial HashAggregate  (cost=4564.59..4566.59 rows=200 width=78) (actual time=196.862..197.072 rows=1470 loops=3)             |
|                     Group Key: name                                                                                                             |
|                     ->  Parallel Seq Scan on persons  (cost=0.00..4303.27 rows=104528 width=78) (actual time=0.039..66.876 rows=200000 loops=3) |
| Planning time: 0.069 ms                                                                                                                         |
| Execution time: 227.301 ms                                                                                                                      |

db<>fiddle here

So in this example the DISTINCT was still faster.
But since the GROUP BY started to work in parallel, that might also depend on the server that hosts the postgresql.