Home > Software design >  How to perform scheduling using python?
How to perform scheduling using python?

Time:05-09

I am trying to schedule a few jobs inside my python. Supposely , the text from the logging should appear every 1 minute and every 5 minute from jobs.py file inside my docker container. However, the text is appearing every 2minutes inside the docker container. Is there a clash between the python schedule and cronjobs ?

Current Output inside the docker container

13:05:00 [I] werkzeug 172.20.0.2 - - [08/May/2022 13:05:00] "GET /reminder/send_reminders HTTP/1.1" 200 -

13:06:00 [I] werkzeug 172.20.0.2 - - [08/May/2022 13:06:00] "GET /feeds/update_feeds HTTP/1.1" 200 -

13:07:00 [D] schedule Running job Job(interval=1, unit=minutes, do=job_feeds_update, args=(), kwargs={})

13:07:00 [I] jobs job_feeds_update

13:07:00 [I] werkzeug 172.20.0.2 - - [08/May/2022 13:07:00] "GET /feeds/update_feeds HTTP/1.1" 200 -

13:08:00 [I] werkzeug 172.20.0.2 - - [08/May/2022 13:08:00] "GET /feeds/update_feeds HTTP/1.1" 200 -

13:09:00 [D] schedule Running job Job(interval=1, unit=minutes, do=job_feeds_update, args=(), kwargs={})

13:09:00 [I] jobs job_feeds_update

13:09:00 [I] werkzeug 172.20.0.2 - - [08/May/2022 13:09:00] "GET /feeds/update_feeds HTTP/1.1" 200 -

13:10:00 [I] werkzeug 172.20.0.2 - - [08/May/2022 13:10:00] "GET /feeds/update_feeds HTTP/1.1" 200 -

13:10:00 [I] werkzeug 172.20.0.2 - - [08/May/2022 13:10:00] "GET /reminder/send_reminders HTTP/1.1" 200 -

13:11:00 [D] schedule Running job Job(interval=1, unit=minutes, do=job_feeds_update, args=(), kwargs={})

13:11:00 [I] jobs job_feeds_update

13:11:00 [D] schedule Running job Job(interval=5, unit=minutes, do=job_send_reminders, args=(), kwargs={})

13:11:00 [I] jobs job_send_reminders

server.py

#Cron Job
@app.route('/feeds/update_feeds')
def update_feeds():
    schedule.run_pending()
    return 'OK UPDATED FEED!'
   
@app.route('/reminder/send_reminders')
def send_reminders():
    schedule.run_pending()
    return 'OK UPDATED STATUS!'
    

jobs.py

def job_feeds_update():
    update_feed()
    update_feed_eng()
    logger.info("job_feeds_update")
        
schedule.every(1).minutes.do(job_feeds_update)

# send email reminders
def job_send_reminders():
    send_reminders()
    logger.info("job_send_reminders")
schedule.every(5).minutes.do(job_send_reminders) 

Docker File

FROM alpine:latest

# Install curlt 
RUN apk add --no-cache curl

# Copy Scripts to Docker Image
COPY reminders.sh /usr/local/bin/reminders.sh
COPY feeds.sh /usr/local/bin/feeds.sh

RUN echo ' */5  *  *  *  * /usr/local/bin/reminders.sh' >> /etc/crontabs/root
RUN echo ' *  *  *  *  * /usr/local/bin/feeds.sh' >> /etc/crontabs/root

# Run crond  -f for Foreground 
CMD ["/usr/sbin/crond", "-f"]

CodePudding user response:

I think you're running into a couple of issues:

  1. As you suspected, your schedule is on a different schedule/interval than your cron job. They're out of sync (and you can't ever expect them to be in sync for the next reason). From the moment your jobs.py script was executed, that's the starting point from which the schedule counts the intervals.

i.e. if you're running something every minute but the jobs.py script starts at 30 seconds into the current minute (i.e. 01:00:30 - 1:00am 30 seconds past), then the scheduler will run the job at 1:01:30, then 1:02:30, then 1:03:30 and so on.

  1. Schedule doesn't guarantee you precise frequency execution. When the scheduler runs a job, the job execution time is not taken into account. So if you schedule something like your feeds/reminders jobs, it could take a little bit to process. Once it's finished running, the scheduler decides that the next job will only run 1 minute after the end of the previous job. This means your execution time can throw off the schedule.

Try running this example in a python script to see what I'm talking about

# Schedule Library imported
import schedule
import time
from datetime import datetime
     
def geeks():
    now = datetime.now() # current date and time
    date_time = now.strftime("%m/%d/%Y, %H:%M:%S")
    time.sleep(5)
    print(date_time   "- Look at the timestamp")
 
geeks();
# Task scheduling
# After every 10mins geeks() is called.
schedule.every(1).seconds.do(geeks)
 
# Loop so that the scheduling task
# keeps on running all time.
while True:
 
    # Checks whether a scheduled task
    # is pending to run or not
    schedule.run_pending()
    time.sleep(0.1)

We've scheduled the geeks function to run every second. But if you look at the geeks function, I've added a time.sleep(5) to pretend that there may be some blocking API call here that can take 5 seconds. Then observe the timestamps logged - you'll notice they're not always consistent with the schedule we originally wanted!

Now onto how your cron job and scheduler are out of sync

Look at the following logs:

13:07:00 [D] schedule Running job Job(interval=1, unit=minutes, do=job_feeds_update, args=(), kwargs={})
13:07:00 [I] jobs job_feeds_update
13:07:00 [I] werkzeug 172.20.0.2 - - [08/May/2022 13:07:00] "GET /feeds/update_feeds HTTP/1.1" 200 -

# minute 8 doesn't trigger the schedule for feeds

13:09:00 [D] schedule Running job Job(interval=1, unit=minutes, do=job_feeds_update, args=(), kwargs={})
13:09:00 [I] jobs job_feeds_update
13:09:00 [I] werkzeug 172.20.0.2 - - [08/May/2022 13:09:00] "GET /feeds/update_feeds HTTP/1.1" 200 -

What's likely happening here is as follows:

  • at 13:07:00, your cron sends the request to feed items

  • at 13:07:00, the job schedule has a pending job for feed items

  • at 13:07:00:, the job finishes and schedule decides the next job can only run after 1 minute from now, which is roughly ~13:08:01 (note the 01, this is to account for milliseconds/timing of job executions, which lets assume it took 1 second to run the feed items update)

  • at 13:08:00, your cron job triggers the request asking schedule run_pending jobs.

  • at 13:08:00 however, there are no pending jobs to run because the next time feed items can run is 13:08:01 which is not right now.

  • at 13:09:00, your cron tab triggers the request again

  • at 13:09:00, there is a pending job available that should've run at 13:08:01 so that gets executed now.

I hope this illustrates the issue you're running into being out of sync between cron and schedule. This issue will become worse in a production environment. You can read more about Parallel execution for schedule as a means to keep things off the main thread but that will only go so far. Let's talk about...

Possible Solutions

  1. Use run_all from schedule instead of run_pending to force jobs to trigger, regardless of when they're actually scheduled for.

But if you think about it, this is no different than simply calling job_feeds_update straight from your API route itself. This isn't a bad idea by itself but it's still not super clean as it will block the main thread of your API server until the job_feeds_update is complete, which might not be ideal if you have other routes that users need.

You could combine this with the next suggestion:

  1. Use a jobqueue and threads Check out the second example on the Parallel Execution page of schedule's docs. It shows you how to use a jobqueue and threads to offload jobs.

Because you run schedule.run_pending(), your main thread in your server is blocked until the jobs run. By using threads ( the job queue), you can keep scheduling jobs in the queue avoid blocking the main server with your jobs. This should optimize things a little bit further for you by letting jobs continue to be scheduled.

  1. Use ischedule instead as it takes into account the job execution time and provides precise schedules: https://pypi.org/project/ischedule/. This might be the simplest solution for you in case 1 2 end up being a headache!

  2. Don't use schedule and simply have your cron jobs hit a route that just runs the actual function (so basically counter to the advice of using 1 2 above). Problem with this is that if your functions take longer than a minute to run for feed updates, you may have multiple overlapping cron jobs running at the same time doing feed updates. So I'd recommend not doing this and relying on a mechanism to queue/schedule your requests with threads and jobs. Only mentioning this as a potential scenario of what else you could do.

  • Related