I need to process thousands of files stored in an S3 bucket. My idea was to spin up a bunch of EC2 VMS (96 cores each) and have them contact a MySQL database via python to get 96 S3 paths to the files it needs to process. I also want to log if it is in progress or has been processed with an IN_PROGRESS AND PROCESSED column set to 0 or 1. I have the current statement, the problem I am running into is that it updates the IN_PROGRESS value, but I am not sure how to select/return the 96 filenames needed to process. I cannot run SELECT * FROM temp, after running the update it returns an error.
UPDATE dev_gba.`dev-l1c` as dev
SET dev.IN_PROGRESS = 1
WHERE dev.`index` in (SELECT * FROM (SELECT `index`
FROM dev_gba.`dev-l1c`
WHERE IN_PROGRESS =0 AND PROCESSED =0
ORDER BY RAND()
LIMIT 96) temp );
My other idea was to get the EC2 instance id using python, update an INSTANCE_ID column and then use another statement to get the rows with that specific instance id. The other challenge I can't seem to figure out is preventing other instances from selecting the same rows. If I have 10 instances querying the database at the same time for 96 random rows I am concerned there might be some overlap. My thought to solve this were to add some random wait time in my python script.
CodePudding user response:
Your question isn't super clear about where you get these instances, so I'm guessing a bit.
Try some SQL like this to avoid race conditions. Be sure to use MySQL's InnoDB storage engine; MyISAM and AriaDB don't support transactions.
BEGIN TRANSACTION;
/* get the indexes to process */
CREATE TEMPORARY TABLE to_process
SELECT index FROM dev_gba.`dev-l1c`
WHERE IN_PROGRESS = 0
AND PROCESSED =0
ORDER BY RAND()
LIMIT 96;
/* do your update */
UPDATE dev_gba.`dev-l1c` SET IN_PROGRESS = 1
WHERE index IN (SELECT index FROM to_process);
/* retrieve the items to process */
SELECT *
FROM dev_gba.`dev-l1c`
WHERE index IN (SELECT index FROM to_process);
/* get rid of the temp table */
DROP TABLE to_process;
/* end the transaction */
COMMIT;
Something to consider: AWS virtual machines are often I/O and network bound. You may get better throughput by using smaller, and dramatically cheaper, instances, and using less concurrency per instance. Plus, there's plenty of OS work to do when you're doing a lot of concurrent file or network operations. Leave at least a core or two for that work.