Why does dask read sql table sometimes slow down?-CodePudding

When inquiring data through FastAPI, sometimes the inquiry speed is more than doubled.

Here is part of the Dataframe inquiry code.

import dask.dataframe as dd

connection_url = URL.create(
    "mssql pyodbc",
    username="aabcc",
    password="12345",
    host="127.0.0.1",
    port=2712,
    database="test",
    query={
        "driver": "ODBC Driver 17 for SQL Server",
        "Trusted_Connection": "yes",
    },
)

def get_data():
    df = dd.read_sql_table(table='troya',
                       uri=connection_url, index_col='no')
    df = df.compute()
    
    return df

Here is part of the FastAPI part code.

@bp.get("/test/{row}")
def test_get(request: Request, row):
    df = get_data()
    ...

I would appreciate it if you could tell me why this problem occurs.

CodePudding user response：

One of the core advantages of dask is the ability to distribute and coordinate workload across multiple workers. This advantage disappears when dask is used to load and immediately compute, so in the snippet above the following two lines are a bit of an antipattern:

    df = dd.read_sql_table(table='troya',
                       uri=connection_url, index_col='no')
    df = df.compute()

What happens is work is distributed, but then has to be transferred from workers to a single client node. The solutions depend on your use case, if it's possible to continue the work in parallel/distributed fashion, then dask might still be handy, but if the workflow has to be sequential, then pandas/sqlalchemy might be a more appropriate choice.