I have a requirement to run a chron job weekly on only a single instance of a java spring service running on multiple aws instances in a cluster. I have the freedom to spawn an instance of same service in a different cluster for handling the special load of the chron job. I am thinking of implementing a sqs listener module in the service which will get triggered once the specially configured lambda on new cluster publishes a sqs message. Since this lambda will only publish in the queue of new instance it should be possible to ensure that only one chron is working at most at a time (for processing data once) with dedicated cluster resources. If we want to keep app level code and config the same across all of the instances, is this approach good to achieve the chron job run schedule as specified ?
CodePudding user response:
I’m a little bit “rusty” with was (used to work with a couple of years ago last time) but in general what you describe seems to be a good solution with the following caveats:
You should make sure that amazon SQS provides “exactly once” semantics because otherwise you might end up triggering the message twice in some cases. I know that do, but probably you should turn it on somehow and the price will slightly change, you should check
Make sure you handle the exceptions that might arise during the job execution in a proper manner, so that the job won’t be re-executed on another instance if the exception cause the sqs driver to interact with sqs server in a way that the message will get back to the sqs queue
What happens if the instance stops during the job execution - that might happen of course, what is the desired behavior? Rerun the job on another instance? Or maybe let it go and rely on the fact that the next time the job will run it will “cover up” for the previous period as well, this depends on the actual application logic
Your application will depend on “external scheduler” (implemented via lambda of course) so that you won’t have any cron triggering logic in the application itself. This is something just to be aware of, not something you can do with. This might be a good thing or not, depending on your environment. For example, if you want to test the job scheduling in CI or something you should have Lambda Deployed and being able to actually send the message to trigger the job execution. Also you should have an SQS available.
So again, I see that you can make it work in general, of course depending on your application architecture other solutions might also apply, as you may use Kubernetes Jobs, Redis, any form of distributed cache to co-ordinate on which node the job actually runs, many things.