Django PostgreSQL best way to improve performance of slow summary aggregation?-CodePudding

Context

I have a Django REST API using PostgreSQL database with millions of Items. These Items are processed by several systems and the processing details are sent back and stored in a Records table. The simplified models are:

class Item(models.Model):
    details = models.JSONField()


class Record(models.Model):
    items = models.ManyToManyField(Item)
    created = models.DateTimeField(auto_created=True)
    system = models.CharField(max_length=100)
    status = models.CharField(max_length=100)
    details = models.JSONField()

Goal

I would like to do arbitrary filters on the Items table and get a summary of various processing systems. This summary obtains the latest status for each selected Item for each system, and displays a count of each status. For example if I filter for 1055 items an example return is:

{
System_1: [running: 5, completed: 1000, error: 50],
System_2: [halted: 55, completed: 1000],
System_3: [submitted: 1055]
}

I currently have this working doing queries like below, which returns the count of processing statuses for System_1 and repeat for the other systems and package into a JSON return.

Item.objects.filter(....).annotate(
    system_1_status=Subquery(
        Record.objects.filter(
            system='System_1',
            items__id=OuterRef('pk')
        ).order_by('-created').values('status')[:1]
    )
).values('system_1_status').annotate(count=Count('system_1_status'))

This converts to sql query:

SELECT 
    "api_item"."id", 
    "api_item"."details", 
    (
        SELECT 
            U0."status" 
        FROM 
            "api_record" U0 
        INNER JOIN 
            "api_record_items" U1 
        ON 
            (U0."id" = U1."record_id") 
        WHERE 
            (U1."item_id" = ("api_item"."id") AND U0."system" = system_1) 
        ORDER BY 
            U0."created" DESC LIMIT 1
    ) AS "system_1_status" 
FROM "api_item"

We have millions of Items and Records and this works reasonably well if we select less than a thousand Items. Above this it takes minutes. Trying to do it for hundreds of thousands of items is catastrophic.

Questions

Can I improve the performance of this query? I can't see how, other than playing with indexes?

Alternatively would it be a bad idea to add a JSONField to the Item models which stores a cache of the latest statuses for each system for that Item? Whilst I don't like the idea of duplicating the data, doing an aggregation over a field already on the Item model should be very quick at querytime. I have DjangoQ which I can use scheduled functions to keep these fields up to date.

CodePudding user response：

You are doing one subquery per item, try doing something like below and use aggregations.

from django.db.models import Case, When, Sum

items = Item.objects.filter(# your condition)
results = Record.objects.values("system").annotate(
    running=Sum(Case(When(status="running", then=1), default=0, 
        output_field=IntegerField())),
    completed=Sum(Case(When(status="completed", then=1), default=0, 
        output_field=IntegerField())),
    # add more status annotations
).order_by("system").filter(items=items)

CodePudding user response：

I think a simple group by might be useful to get the status;

from django.db.models import Count
Item.objects.filter(record__system='System_1').values('record__status').annotate(c=Count('record__status')).values('record__status', 'c')