There are 2 ways to do the same thing.
Example: get distinct names from persons database
The first way is:
SELECT name
FROM person
GROUP BY name
Has the same result as:
SELECT DISTINCT name
FROM person
What I am curious of, is there anything different in the way an Postgresql SQL engine processes the command and which way is faster, or are they doing the same thing?
CodePudding user response:
When it comes to DISTINCT
or GROUP BY
being used, it depends on whether you care about the performance or not.
Using GROUP BY
is inherently slower due to the added step of sorting, which is performed at the end of the execution, whereas DISTINCT
does not perform this action.
Do check that GROUP BY
provides the results you're looking for though, as in some cases the results of the query differ whether it is using DISTINCT
or GROUP BY
.
CodePudding user response:
For only a few records (f.e. 100000) it doesn't really matter. Both will use the same HashAggregate method.
And then a golfcoder would prefere DISTINCT
because it has a bit shorter syntax.
The GROUP BY
is more intended to be used with aggregation functions anyway, like MAX, SUM, COUNT, AVG, ...
But for a larger recordset there's a difference.
For example in this demo
create table Persons ( Name varchar(30) ) INSERT INTO Persons (Name) SELECT arrays.firstnames[s.a % ARRAY_LENGTH(arrays.firstnames,1) 1] || arrays.lastnames[s.a % ARRAY_LENGTH(arrays.lastnames,1) 1] AS name FROM generate_series(1,600000) AS s(a) -- number of names to generate CROSS JOIN( SELECT ARRAY[ 'Adam','Bill','Bob','Calvin','Donald','Dwight','Frank','Fred','George','Howard', 'James','John','Jacob','Jack','Martin','Matthew','Max','Michael','Lukas', 'Paul','Peter','Phil','Roland','Ronald','Samuel','Steve','Theo','Warren','William', 'Abigail','Alice','Allison','Amanda','Anne','Barbara','Betty','Carol','Cleo','Donna', 'Jane','Jennifer','Julie','Martha','Mary','Melissa','Patty','Sarah','Simone','Susan' ] AS firstnames, ARRAY[ 'Matthews','Smith','Jones','Davis','Jacobson','Williams','Donaldson','Maxwell','Peterson','Storms','Stevens', 'Franklin','Washington','Jefferson','Adams','Jackson','Johnson','Lincoln','Grant','Fillmore','Harding','Taft', 'Truman','Nixon','Ford','Carter','Reagan','Bush','Clinton','Hancock' ] AS lastnames ) AS arrays
select count(*) from Persons
| count | | -----: | | 600000 |
explain analyse select distinct Name from Persons
| QUERY PLAN | | :------------------------------------------------------------------------------------------------------------------- | | HashAggregate (cost=6393.82..6395.82 rows=200 width=78) (actual time=194.609..194.757 rows=1470 loops=1) | | Group Key: name | | -> Seq Scan on persons (cost=0.00..5766.66 rows=250866 width=78) (actual time=0.030..61.243 rows=600000 loops=1) | | Planning time: 0.259 ms | | Execution time: 194.898 ms |
explain analyse select Name from Persons group by Name
| QUERY PLAN | | :---------------------------------------------------------------------------------------------------------------------------------------------- | | Group (cost=5623.88..5625.88 rows=200 width=78) (actual time=226.358..227.145 rows=1470 loops=1) | | Group Key: name | | -> Sort (cost=5623.88..5624.88 rows=400 width=78) (actual time=226.356..226.596 rows=4410 loops=1) | | Sort Key: name | | Sort Method: quicksort Memory: 403kB | | -> Gather (cost=5564.59..5606.59 rows=400 width=78) (actual time=206.700..219.546 rows=4410 loops=1) | | Workers Planned: 2 | | Workers Launched: 2 | | -> Partial HashAggregate (cost=4564.59..4566.59 rows=200 width=78) (actual time=196.862..197.072 rows=1470 loops=3) | | Group Key: name | | -> Parallel Seq Scan on persons (cost=0.00..4303.27 rows=104528 width=78) (actual time=0.039..66.876 rows=200000 loops=3) | | Planning time: 0.069 ms | | Execution time: 227.301 ms |
db<>fiddle here
So in this example the DISTINCT was still faster.
But since the GROUP BY started to work in parallel, that might also depend on the server that hosts the postgresql.