I have a table (>500GB) from which I need to select 5000 random rows where table.condition = True and 5000 random rows where table.condition = False. My attempts until now used tablesample, but, unfortunately, any WHERE clause is only applied after the sample has been generated. So the only way I see this working is by doing the following:
Generate 2 empty temporary_tables -- temporary_table_true and temporary_table_false -- with the structure of the main table, so I can add rows iteratively.
create temp temporary_table_true as select table.condition, table.b, table.c, ... table.z from table LIMIT 0 create temp temporary_table_false as select table.condition, table.b, table.c, ... table.z from table LIMIT 0
Create a loop that only stops when the size of my temporary_tables are both 5000.
Inside that loop I generate a batch of 100 random samples from table, in each iteration. From those random rows I insert the ones with the table.condition = True in my temporary_table_true and the ones with the table.condition = False in my temporary_table_false.
Could you guys give me some help here?
- Are there any better approaches?
- If not, any idea on how I could code parts 2. and 3.?
Thanks a lot!
CodePudding user response:
Add a column to your table and populate it with random numbers.
ALTER TABLE `table` ADD COLUMN rando FLOAT DEFAULT NULL;
UPDATE `table` SET rando = RAND() WHERE rando IS NULL;
Then do
SELECT *
FROM `table`
WHERE rando > RAND() * 0.9
AND condition = 0
ORDER BY rando
LIMIT 5000
Do it again for condition = 1
and Bob's your uncle. It will pull rows in random order starting from a random row.
A couple of notes:
- 0.9 is there to improve the chances you'll actually get 5000 rows and not some lesser number.
- You may have to add
LIMIT 1000
to the UPDATE statement and run it a whole bunch of times to populate the completerando
column: trying to update all the rows in a big table can generate a huge transaction and swamp your server for a long time. - If you need to generate another random sample, run the UPDATE or UPDATEs again.
CodePudding user response:
The textbook solution would be to run two queries, one for rows with true
and one for rows with `false:
SELECT * FROM mytable WHERE `condition`=true ORDER BY RAND() LIMIT 5000;
SELECT * FROM mytable WHERE `condition`=false ORDER BY RAND() LIMIT 5000;
The WHERE clause applies first, to reduce the matching rows, then it sorts the subset of rows randomly and picks up to 5000 of them. The result is a random subset.
This solution has an advantage that it returns a pretty evenly distributed set of random rows, and it automatically handles cases like there being an unknown proportion of true to false in the table, and even handles if one of the condition values matches fewer than 5000 rows.
The disadvantage is that it's incredibly expensive to sort such a large set of rows, and an index does not help you sort by a nondeterministic expression like RAND().
You could do this with window functions if you need it to be a single SQL query, but it would still be very expensive.
SELECT t.*
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY `condition` ORDER BY RAND()) AS rownum
FROM mytable
) AS t
WHERE t.rownum <= 5000;
Another alternative that does not use a random sort operation would be to do a table-scan, and pick a random subset of rows. But you need to know roughly how many rows match each condition value, so that you can estimate the fraction of these that would make ~5000 rows. Say for example there are 1 million rows with the true value and 500k rows with the false value:
SELECT * FROM mytable WHERE `condition`=true AND RAND()*1000000 < 5000;
SELECT * FROM mytable WHERE `condition`=false AND RAND()*500000 < 5000;
This is not guaranteed to return exactly 5000 rows, because of the randomness. But probably pretty close. And a table-scan is still quite expensive.
The answer from O.Jones gives me another idea. If you can add a column, then you can add an index on that column.
ALTER TABLE `table`
ADD COLUMN rando FLOAT DEFAULT NULL,
ADD INDEX (`condition`, rando);
UPDATE `table` SET rando = RAND() WHERE rando IS NULL;
Then you can use indexed searches. Again, you need to know how many rows match each value to do this.
SELECT * FROM mytable
WHERE `condition`=true AND rando < 5000/1000000
ORDER BY `condition`, rando
LIMIT 5000;
SELECT * FROM mytable
WHERE `condition`=true AND rando < 5000/500000
ORDER BY `condition`, rando
LIMIT 5000;
The ORDER BY in this case should be a no-op if the index I added is used. The rows will be read in index order anyway, and MySQL's optimizer will not do any work to sort them.
This solution will be much faster, because it doesn't have to sort anything, and doesn't have to do a table-scan. MySQL has an optimization to bail out of a query once the LIMIT has been satisfied.
But the disadvantage is that it doesn't return a different random result when you run the SELECT again, or if different clients run the query. You would have to use UPDATE to re-randomize the whole table to get a different result. This might not be suitable depending on your needs.