Home > database >  SQL/SQLAlchemy filter on previously limited results
SQL/SQLAlchemy filter on previously limited results

Time:09-10

I have to filter my products by the status name. I have a subquery to get the last status for each product (the last status is the only one relevant for my usecase).

I want to get Product whose last status matches a list of status names (statuses in my example).

Models:

class Product(BaseModel):
    __tablename__ = "products"

    id = Column(UUID(as_uuid=True), primary_key=True, default=uuid4)
    statuses = relationship("Status", back_populates="product")

class Status(BaseModel):
    __tablename__ = "status"

    id = Column(UUID(as_uuid=True), primary_key=True, default=uuid4)
    name = Column(String(255), nullable=False)
    created_at = Column(DateTime, server_default=functions.now(), index=True)

Query:

subquery = (
    select(Status.id)
    .filter(
        Status.product_id == Product.id,
    )
    .order_by(Status.created_at.desc())
    .limit(1)
    .correlate(Status)
    .subquery()
)

I want to filter on status names with the following code:

statement = (
    select(Product.id)
    .outerjoin(Status)
    .filter(
        Status.id.in_(subquery),
        Status.name.in_(statuses),
    )
)

But unfortunately, I got products with the last status not matching my statuses.

I think this is because I wrote my SQLAlchemy request like

Give me products with status matching the statuses

and it looks to the last status matching the statuses, instead of

Give me products with THE LAST status matching the statuses

for it to look to the last status AND check if this status match the statuses.

Why this result?

CodePudding user response:

What you need in your subquery is a GROUP BY clause, so that the subquery contains the latest status for every product, rather than only having the latest status overall (which is what was happening in your subquery when you used LIMIT 1). Then you do your join and filter for only the status values that you want.

import sqlalchemy as sa

subquery = (
    select(Status)
    .order_by(Status.created_at.desc())
    .group_by(Status.product_id)
    .subquery()
)

active_products_query = (
    select(Product.id)
    .join(subquery, Product.id == subquery.c.product_id)
    .filter(subquery.c.name.in_(["active"]))
)

print(session.execute(active_products_query).scalars().all())

CodePudding user response:

Option-1: One way would be to add a JOIN with the filter which will select only the latest Status:

query = (
    select(Product)

    .innerjoin(
        Status,
        Status.id == (
            select(Status.id)
            .filter(Status.product_id == Product.id)
            .order_by(Status.created_at.desc())
            .limit(1)
            .scalar_subquery()
            .correlate(Product)
        )
    )

    .filter(Status.name.in_(statuses))
)

Option-2: Add computed_column for last_status_name

First, add this property to the Product:

Product.last_status_name = column_property(
    select(Status.name.label("last_status_name"))
    .filter(Status.product_id == Product.id)
    .order_by(Status.created_at.desc())
    .limit(1)
    .scalar_subquery(),
    deferred=True  # prevent from always loading this from the database
)

Then you can use it whenever you need to simply by adding a filter to a query:

query = (
    select(Product)
    .filter(Product.last_status_name.in_(statuses))
)

And if you remote the deferred=True above, you will always get it loaded from the database with each query.

  • Related