SQLAlchemy queries running twice, only on a separate thread with long execution time-CodePudding

My application creates a Flask app as well as a background process that does work with my MySQL database (through SQLAlchemy) every so often:

from task_manager import TaskManager
# Session is a sessionmaker created earlier
task_manager = TaskManager(timedelta(seconds = 1), timedelta(seconds = 1), Session)
threading.Thread(target=task_manager.scheduler_loop).start()

app.run(debug=True, host='0.0.0.0', port=5000)

Whenever this process finds an available task (this is in the scheduler_loop that's running in the separate thread), it does some work:

with db_session(self.Session) as session:
    task = session.query(Task).filter(or_(Task.date == None, Task.date <= datetime.now())).order_by(Task.priority).first()

    if task is not None:
        if task.type == "create_paper_task":
            self.create_paper(session, task.paper_title)
        elif task.type == "update_citations_task":
            self.update_citations(session, task.paper)

        session.delete(task)

...

def create_paper(self, session, paper_title):

    ...
    
    # For the purposes of testing, I replaced a long API call with this sleep.
    time.sleep(3)

    paper = Paper(paper_title, year)
    paper.citations.append(Citation(citations, datetime.now()))
    session.add(paper)

If I try to use this code, the SQLAlchemy queries are run twice. Two Paper objects are created, and I get this error (presumably the Task being deleted twice):

/app/task_manager.py:17: SAWarning: DELETE statement on table 'create_paper_task' expected to delete 1 row(s); 0 were matched.  Please set confirm_deleted_rows=False within the mapper configuration to prevent this warning.

The actual code itself isn't running twice, and there definitely aren't multiple scheduler threads running: I've tested this using print statements.

Now, the weirdest part about this is that the issue ONLY occurs when

There's a long wait during the execution. If the time.sleep is removed, there's no problem, and
The Flask app is running and the scheduler loop is running in a separate thread. If the Flask app isn't running, or the scheduler_loop is running in the main thread (so obviously the Flask app isn't running), then there's no problem.

Also, the Flask app isn't being used at all while I'm testing this, so that's not the issue.

CodePudding user response：

The app.run function of Flask will run your initialization code twice when you set debug=True. This is part of the way Flask can detect code changes and dynamically restart as needed. The downside is that this is causing your thread to run twice which in turn creates a race condition on reading and executing your tasks, which indeed would only show up when the task takes long enough for the second thread to start working.

See this question/answer for more details about what is happening: Why does running the Flask dev server run itself twice?

To avoid this you could add code to avoid the second execution, but that has the limitation that the auto-reloading feature for modified code will no longer work. In general, it would probably be better to use something like Celery to handle task execution instead of building your own solution. However, as mentioned in the linked answer, you could use something like

from werkzeug.serving import is_running_from_reloader
if is_running_from_reloader():
    from task_manager import TaskManager
    task_manager = TaskManager(timedelta(seconds = 1), timedelta(seconds = 1), Session)
    threading.Thread(target=task_manager.scheduler_loop).start()

which would keep your thread from being created unless you are in the second (reloaded) process. Note this would prevent your thread from executing at all if you remove debug=True.