Home > OS >  Django PostgreSQL best way to improve performance of slow summary aggregation?
Django PostgreSQL best way to improve performance of slow summary aggregation?

Time:06-15

Context

I have a Django REST API using PostgreSQL database with millions of Items. These Items are processed by several systems and the processing details are sent back and stored in a Records table. The simplified models are:

class Item(models.Model):
    details = models.JSONField()


class Record(models.Model):
    items = models.ManyToManyField(Item)
    created = models.DateTimeField(auto_created=True)
    system = models.CharField(max_length=100)
    status = models.CharField(max_length=100)
    details = models.JSONField()

Goal

I would like to do arbitrary filters on the Items table and get a summary of various processing systems. This summary obtains the latest status for each selected Item for each system, and displays a count of each status. For example if I filter for 1055 items an example return is:

{
System_1: [running: 5, completed: 1000, error: 50],
System_2: [halted: 55, completed: 1000],
System_3: [submitted: 1055]
}

I currently have this working doing queries like below, which returns the count of processing statuses for System_1 and repeat for the other systems and package into a JSON return.

Item.objects.filter(....).annotate(
    system_1_status=Subquery(
        Record.objects.filter(
            system='System_1',
            items__id=OuterRef('pk')
        ).order_by('-created').values('status')[:1]
    )
).values('system_1_status').annotate(count=Count('system_1_status'))

This converts to sql query:

SELECT 
    "api_item"."id", 
    "api_item"."details", 
    (
        SELECT 
            U0."status" 
        FROM 
            "api_record" U0 
        INNER JOIN 
            "api_record_items" U1 
        ON 
            (U0."id" = U1."record_id") 
        WHERE 
            (U1."item_id" = ("api_item"."id") AND U0."system" = system_1) 
        ORDER BY 
            U0."created" DESC LIMIT 1
    ) AS "system_1_status" 
FROM "api_item"

We have millions of Items and Records and this works reasonably well if we select less than a thousand Items. Above this it takes minutes. Trying to do it for hundreds of thousands of items is catastrophic.

Questions

Can I improve the performance of this query? I can't see how, other than playing with indexes?

Alternatively would it be a bad idea to add a JSONField to the Item models which stores a cache of the latest statuses for each system for that Item? Whilst I don't like the idea of duplicating the data, doing an aggregation over a field already on the Item model should be very quick at querytime. I have DjangoQ which I can use scheduled functions to keep these fields up to date.

CodePudding user response:

You are doing one subquery per item, try doing something like below and use aggregations.

from django.db.models import Case, When, Sum

items = Item.objects.filter(# your condition)
results = Record.objects.values("system").annotate(
    running=Sum(Case(When(status="running", then=1), default=0, 
        output_field=IntegerField())),
    completed=Sum(Case(When(status="completed", then=1), default=0, 
        output_field=IntegerField())),
    # add more status annotations
).order_by("system").filter(items=items)

CodePudding user response:

I think a simple group by might be useful to get the status;

from django.db.models import Count
Item.objects.filter(record__system='System_1').values('record__status').annotate(c=Count('record__status')).values('record__status', 'c')
  • Related