Home > Software engineering >  Why PostGIS slower than MySQL in searching spatial data?
Why PostGIS slower than MySQL in searching spatial data?

Time:12-01

I heard that PostGIS is optimized for searching geographic data like geometry, polygraph.. etc. so I tested it to figure out how much faster it is than MySQL.

first, this is the query currently used for searching nearby places from a given latitude and longitude in MySQL.

select * from place where mbrcontains(ST_LINESTRINGFROMTEXT('LINESTRING(127.0214 37.4777, 126.9444 37.4166)'), coord)

it takes 30 milliseconds on average.

And these queries are tested for the same purpose in PostgreSQL with PostGIS extension. coord column is geography type and coord2 column is geometry type, two columns have GIST index.

1. select * from place
where coord2 && st_setsrid(st_makebox2d(ST_point(127.0214, 37.4777), ST_Point(126.9444, 37.4166)), 4326);

2. select * from place
where st_dwithin('SRID=4326;POINT(126.9829371 37.4472168)'::geography, coord, 1000);

3. select * from place
where 'LINESTRING(127.0214 37.4777, 126.9444 37.4166)'::geometry ~ coord2

These take at least 150 milliseconds and 250 milliseconds on average.

as a result, rather, MySQL is at least about five times faster than PostGIS.

specs of database servers are same and rows of tested data is about 40000.
version of MySQL is 5.7.33, PostgreSQL is 14.5

Is this a normal result?

Query Plans each queries

1. "Bitmap Heap Scan on place  (cost=10.32..622.30 rows=264 width=176) (actual time=0.083..0.117 rows=84 loops=1)"
"  Recheck Cond: (coord2 && '0103000020E61000000100000005000000EA95B20C71BC5F40BEC1172653B54240EA95B20C71BC5F404CA60A4625BD42409A081B9E5EC15F404CA60A4625BD42409A081B9E5EC15F40BEC1172653B54240EA95B20C71BC5F40BEC1172653B54240'::geometry)"
"  Heap Blocks: exact=12"
"  Buffers: shared hit=20"
"  ->  Bitmap Index Scan on coord2_index  (cost=0.00..10.26 rows=264 width=0) (actual time=0.076..0.077 rows=84 loops=1)"
"        Index Cond: (coord2 && '0103000020E61000000100000005000000EA95B20C71BC5F40BEC1172653B54240EA95B20C71BC5F404CA60A4625BD42409A081B9E5EC15F404CA60A4625BD42409A081B9E5EC15F40BEC1172653B54240EA95B20C71BC5F40BEC1172653B54240'::geometry)"
"        Buffers: shared hit=8"
"Planning Time: 0.134 ms"
"Execution Time: 0.147 ms"

2. "Bitmap Heap Scan on place  (cost=4.53..491.25 rows=4 width=176) (actual time=0.067..0.130 rows=11 loops=1)"
"  Filter: st_dwithin('0101000020E61000009BA10271E8BE5F40631C6D663EB94240'::geography, coord, '2000'::double precision, true)"
"  Rows Removed by Filter: 16"
"  Heap Blocks: exact=5"
"  Buffers: shared hit=12"
"  ->  Bitmap Index Scan on coord_index  (cost=0.00..4.53 rows=17 width=0) (actual time=0.055..0.056 rows=27 loops=1)"
"        Index Cond: (coord && _st_expand('0101000020E61000009BA10271E8BE5F40631C6D663EB94240'::geography, '2000'::double precision))"
"        Buffers: shared hit=7"
"Planning Time: 0.116 ms"
"Execution Time: 0.150 ms"

3. "Bitmap Heap Scan on place  (cost=4.59..141.82 rows=40 width=176) (actual time=0.133..0.167 rows=84 loops=1)"
"  Recheck Cond: ('0102000000020000009A081B9E5EC15F404CA60A4625BD4240EA95B20C71BC5F40BEC1172653B54240'::geometry ~ coord2)"
"  Heap Blocks: exact=12"
"  Buffers: shared hit=20"
"  ->  Bitmap Index Scan on coord2_index  (cost=0.00..4.58 rows=40 width=0) (actual time=0.126..0.126 rows=84 loops=1)"
"        Index Cond: (coord2 @ '0102000000020000009A081B9E5EC15F404CA60A4625BD4240EA95B20C71BC5F40BEC1172653B54240'::geometry)"
"        Buffers: shared hit=8"
"Planning Time: 0.060 ms"
"Execution Time: 0.199 ms"
  • PostgreSQL 14.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-12), 64-bit"
  • POSTGIS=""3.1.7 aafe1ff"" [EXTENSION] PGSQL=""140"" GEOS=""3.9.1-CAPI-1.14.2"" PROJ=""8.0.1"" LIBXML=""2.9.1"" LIBJSON=""0.15"" LIBPROTOBUF=""1.3.2"" WAGYU=""0.5.0 (Internal)"""

CodePudding user response:

Summing up the comments, to close the thread with an answer:

  1. "Execution Time: 0.147 ms" reads zero point one four seven of a millisecond, which is 147 microseconds.

  2. Reproducibility/consistency: if you run the test once and one query runs better, it still can be worse the next time you run it. Use pgbench to run your tests multiple times over and over, to let the noise average out.

  3. Sample size: you're working on a table with less than a 100 rows, so the differences in execution times are close to noise. You can spawn some more samples by duplicating and shifting your data round. Demo:

    create table place (id serial primary key, coord2 geometry(Point,4326));
    insert into place (coord2) 
    values ('SRID=4326;POINT(126.9829371 37.4472168)'::geometry);
    
    --1000000 test records 
    --in random places within 30 degrees distance from sample point
    create table tests as 
    select tests.test_case_id,
           repeat(gen_random_uuid()::text,30) as a_heavy_column,
           ST_Translate(coord2, random()*60 -30, random()*60 -30) as test_geom
    from place p join generate_series(1,1e6,1) tests(test_case_id) on true;
    
    explain analyze 
    select *
    from tests
    where st_dwithin('SRID=4326;POINT(126.9829371 37.4472168)'::geometry,
                     test_geom,
                     5);
    

    This also lets you see how things scale. A query running 1 second is "twice as fast" as one running 2 seconds, but it might no longer be the case when you run against a bigger sample - the difference can become negligible or even go the other way, if the second query scales better.

  4. Controlled environment (know what you're measuring): even if the platform is the same, it's a challenge to set up config equivalents for the two different RDBMS, to be sure they are given the same level of access to the platform resources. Other activity on the platform needs to be restricted, other traffic on the DBs needs to be restricted, tests should run in-db or at least locally to eliminate latency.

  • Related