We recently moved our database to a centralised server for our whole team to use the same data source instead of using dumps from the database and spinning it up in a docker container.
Recently, I noticed that when executing the same SQL-query multiple times on the server returns different results whereas when using the local backup in docker returns the same results when executed multiple times.
The length of the results is the same, but we require a deterministic result of the query which is not given with the server.
This is the code that I used in a jupyter notebook to compare those two queries:
sql_query = """
SELECT * FROM reports.consumption
WHERE name LIKE '%%electricity%%'
"""
server_df = pd.read_sql(sql_query,
con="<my_credentials_to_server>")
display(server_df.tail(10))
docker_df = pd.read_sql(sql_query,
con="<my_credentials_to_local_docker>")
display(docker_df.tail(10))
When executing this code multiple times I receive the same tail in the docker version of the database, but not on the server. Why is that the case? Is there some randomness involved?
CodePudding user response:
Your SQL query does not enforce any order. This means that the DB engine is free to choose any. If the same result is returned with a different order from the two deployments, your tail
will likely be different.
I can imagine a number of reasons why two deployments of the same database may return the same result with a different order, including:
- they may run different versions, which make different algorithmic choices
- the choice of algorithm or parameters may depend on the specific hardware / software / load at hand
- the execution plan may choose randomly between two algorithms that have been estimated to be equally expensive.
I recommend to add an ORDER BY
clause to enforce a specific order on all deployments, or to sort your dataframe in Python.