what is the fastest way to find the same geopoint in a table?-CodePudding

i have a table with 80 000 000 rows which are geopoints from 20 000 different gid.

my table has this form :

gid	idx_pt	id_unique	x	y	geom
1	1	na	65250.78	700000.52	01010000206A.....
1	2	na	65250.78	700000.52	01010000206A.....
1	3	na	65250.78	700000.52	01010000206A.....
1	4	na	65999.00	700555.00	01010000455A.....
1	5	na	65999.00	700555.00	01010000455A.....
1	6	na	65999.00	700555.00	01010000455A.....

for each gid i have like thousands of gepoints (ex gid 1 ==> 5097 rows).

Some of the geopoints are the same (like the three first points in my example above) and i want for each idx_pt to find the first geopoint of the gid When identified i need to update the id_unique columns with the idx_pt of the first point found.

what i am looking for :

gid	idx_pt	id_unique	x	y	geom
1	1	1	65250.78	700000.52	01010000206A.....
1	2	1	65250.78	700000.52	01010000206A.....
1	3	1	65250.78	700000.52	01010000206A.....
1	4	4	65999.00	700555.00	01010000455A.....
1	5	4	65999.00	700555.00	01010000455A.....
1	6	4	65999.00	700555.00	01010000455A.....

I succeed to perform this update but i have a problem of performance. I run my postgresql server on a small server (nas synology) and the query is running for more than 24 hours now and just 50% of the query is done ... to be sure that even if the server closed prematurally i keep the update rows i do a for loop update with a commit instruction for each gid.

my query for gid = 1 :

SELECT a.idx_pt as idx_pt, c.idx_pt as id_unique
FROM mytable a
CROSS JOIN LATERAL
(
SELECT  idx_pt as idx_pt
FROM mytable 
WHERE
gid = a.gid
AND
ST_Equals(a.geom , geom)
ORDER BY idx_pt
LIMIT 1
) c 
WHERE a.gid = 1

The logic is : for each idx_pt i do a lateral join based on the same gid, same geopoints ordered by the idx_pt column and keeping only the first rows.

this query take 12 seconds to perform on my little server (5097 rows X 5097 rows)

My problem is than i have 20 000 gid for a total of 80 000 000 rows so it is very very slow.

My Question : how could i improve the performance for this task ?

Thanks in advance

CodePudding user response：

The window function FIRST_VALUE() with a PARTITION by gid, geom is what you're looking for:

SELECT 
  gid,idx_pt,
  FIRST_VALUE(idx_pt) OVER w AS id_unique,
  x,y,geom
FROM mytable
WHERE gid = 1
WINDOW w AS (PARTITION BY gid,geom ORDER BY idx_pt
             RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
ORDER BY gid, idx_pt;

A CROSS JOIN means that for each record you would fire a full scan in your 80m records table, which is highly inefficient.

The frame RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING is optional in this case, but it is considered to be a good practice to make things explicit. It basically tells the function to perform its operations in the given partition considering all records from it, as UNBOUNDED PRECEDING means every record before the current row and UNBOUNDED FOLLOWING means every record after the current row. Window functions are not very palatable at first, but once you get the gist of it they become your best friends.

Regarding speed: make sure the column gid is properly indexed.

Demo: db<>fiddle

Side note: to store the longitude and latitude values alongside a geometry column containing the same coordinates is not really necessary. You can always get the x and y values from the geometries using ST_X and ST_Y, respectively. In a 80m rows table this would make some difference in disk usage ;)