I am trying to add a very basic SQL query in my flask API project. I am using SQLAlchemy as the database manipulation tool.
The query I want to run is the following:
SELECT * from trip_metadata where trip_id in ('trip_id_1', 'trip_id_2', ..., 'trip_id_n')
So, in my code, I wrote:
trips_ids = ['trip_id_1', 'trip_id_2', ..., 'trip_id_n']
result = session.query(dal.trip_table).filter(dal.trip_table.columns.trip_id.in_(trips_ids)).all()
When n is low, let'say n=10, it works very well. I get the expected result. However, when n is high, let's say n > 1000, it crashes. I am very surprised as it seems usual to put many values in the filter.
from sqlalchemy import text
result = session.execute(text(f"SELECT * FROM trip_metadata where trip_id in {trip_ids_tuple}"))
The error log is:
sqlalchemy.exc.DBAPIError: (pyodbc.Error) ('07002', '[07002] [Microsoft][ODBC Driver 17 for SQL Server]COUNT field incorrect or syntax error (0) (SQLExecDirectW)')')
[SQL: SELECT * FROM trip_metadata
WHERE trip_metadata.trip_id IN (?, ?, ..., ?)]
[parameters: ('ABC12345-XXXX-XXXX-XXXX-000000000000', 'DEF12345-XXXX-XXXX-XXXX-000000000000', ..., 'GHI12345-XXXX-XXXX-XXXX-000000000000')]
(Background on this error at: https://sqlalche.me/e/14/dbapi)
127.0.0.1 - - [05/Jan/2023 10:35:48] "POST /api/v1/tripsAggregates HTTP/1.1" 500 -
However when I write the raw request, it works well, even when n is very high:
from sqlalchemy import text
trip_ids_tuple = ('trip_id_1', 'trip_id_2', ..., 'trip_id_n')
result = session.execute(text(f"SELECT * FROM trip_metadata where trip_id in {trip_ids_tuple}"))
But I don't think this is a good way of doing because I have much more complex requests to write and using sqlalchemy filters is more adapted.
Do you have any idea to fix my issue keeping using sqlalchemy library ? Thank you very much
CodePudding user response:
From the error, it could indicate a formatting issue (escape string characters properly?). This can happen when N is small, the data that breaks the formatting has low chance of occurring. When N gets large, there's more likelihood there is "bad data" that sqlalchemy tries to put into the query. Can't say for certain here, it might be a memory or operating system issue too.
First thing to ask here, do you need to have tuples provided externally into the query? Is there a way to query for the the
trip_ids
via a join? Usually it's best to push operations to the SQL engine, but this isn't always possible if you're getting the tuples/lists of id's elsewhere.Rule out if there's a data issue that errors out during the
execute()
. Look into escaping the string values. You can try chunking the list into smaller bits to narrow down the potentially problematic values (see appendix below)
Try a different way to string format the query.
sql = f"SELECT * FROM trip_metadata where trip_id in ({','.join(trip_ids_tuple)})"
sql
output str value:
'SELECT * FROM trip_metadata where trip_id in (trip_id_1,trip_id_2,trip_id_n)'
Appendix:
You can potentially build in a chunker mechanism to break. The crashing might be a operating system or memory issue. For Example, you can use list slicing:
def chunker(seq, size):
return (seq[pos:pos size] for pos in range(0, len(seq), size))
# Example usage
my_list = [1, 2, 3, 4, 5, 6, 7, 8]
chunk_size = 3
for chunk in chunker(my_list, chunk_size):
print(chunk)
output
[1, 2, 3]
[4, 5, 6]
[7, 8]
Use a chunk size where N is manageable. This will also help narrow down potentially problematic str values that errors out.
CodePudding user response:
Microsoft's ODBC drivers for SQL Server execute statements using a system stored procedure on the server (sp_prepexec
or sp_prepare
). Stored procedures on SQL Server are limited to 2100 parameter values, so with a model like
class Trip(Base):
__tablename__ = "trip"
id = Column(String(32), primary_key=True)
this code will work
with Session(engine) as session:
trips_ids = ["trip_id_1", "trip_id_2"]
q = session.query(Trip).where(Trip.id.in_(trips_ids))
results = q.all()
"""SQL emitted:
SELECT trip.id AS trip_id
FROM trip
WHERE trip.id IN (?, ?)
[generated in 0.00092s] ('trip_id_1', 'trip_id_2')
"""
because it only has two parameter values. If the length of the trips_ids
list is increased to thousands of values the code will eventually fail.
One way to avoid the issue is to have SQLAlchemy construct an IN clause with literal values instead of parameter placeholders:
q = session.query(Trip).where(
Trip.id.in_(bindparam("p1", expanding=True, literal_execute=True))
)
results = q.params(p1=trips_ids).all()
"""SQL emitted:
SELECT trip.id AS trip_id
FROM trip
WHERE trip.id IN ('trip_id_1', 'trip_id_2')
[generated in 0.00135s] ()
"""