I'm working on PostgreSQL in dbeaver. The database has a column addr:country
and a column addr:city
. The data has around 500 million rows so I have to do a ramdom sampling for testing. I intended to randomly select 1% of the data. However, the data itself could be highly biased (as there are big countries, and small countries, thus there are more rows for big countries, and less for smaller countries), and I'm thinking of a way to sample fairly. So I want to randomly select one or two rows from each city in each country.
The script I'm using is modified from someone else's query, and my script is:
SELECT osm_id, way, tags, way_centroid, way_area, calc_way_area, area_diff, area_prct_diff, calc_perimeter, calc_count_vertices, building, "building:part", "type", amenity, landuse, tourism, office, leisure, man_made, "addr:flat", "addr:housename", "addr:housenumber", "addr:interpolation", "addr:street", "addr:city", "addr:postcode", "addr:country", length, width, height, osm_uid, osm_user, osm_version
ROW_NUMBER() OVER ( PARTITION BY "addr:country", "addr:city" ) AS "cell_rn",
COUNT(*)
OVER ( PARTITION BY "addr:country", "addr:city") AS "cell_cnt"
FROM osm_qa.buildings
WHERE "addr:city" IS NOT NULL
AND "addr:country" IS NOT NULL
It returns error message: SQL Error [42601]: ERROR: syntax error at or near "(" Position: 1683
.
I am very new to SQL so probably there are many mistakes in the script. Is there any way to enforce random selection of one/two rows from each addr:city
in each addr:country
?
CodePudding user response:
you can use the window function dense_rank() to randomly number records in a partition:
with base_data as
(
SELECT osm_id, way, tags, way_centroid, way_area, calc_way_area, area_diff, area_prct_diff, calc_perimeter, calc_count_vertices, building, "building:part", "type", amenity, landuse, tourism, office, leisure, man_made, "addr:flat", "addr:housename", "addr:housenumber", "addr:interpolation", "addr:street", "addr:city", "addr:postcode", "addr:country", length, width, height, osm_uid, osm_user, osm_version,
ROW_NUMBER() OVER ( PARTITION BY "addr:country", "addr:city" ) AS "cell_rn",
COUNT(*) OVER ( PARTITION BY "addr:country", "addr:city") AS "cell_cnt",
dense_rank() over (partition by "addr:country", "addr:city" order by random()) as ranking,
FROM osm_qa.buildings
WHERE "addr:city" IS NOT NULL
AND "addr:country" IS NOT null
)
select
*
from base_data
where ranking between 1 and 2