Context
I have a Django REST API using PostgreSQL database with millions of Items. These Items are processed by several systems and the processing details are sent back and stored in a Records table. The simplified models are:
class Item(models.Model):
details = models.JSONField()
class Record(models.Model):
items = models.ManyToManyField(Item)
created = models.DateTimeField(auto_created=True)
system = models.CharField(max_length=100)
status = models.CharField(max_length=100)
details = models.JSONField()
Goal
I would like to do arbitrary filters on the Items table and get a summary of various processing systems. This summary obtains the latest status for each selected Item for each system, and displays a count of each status. For example if I filter for 1055 items an example return is:
{
System_1: [running: 5, completed: 1000, error: 50],
System_2: [halted: 55, completed: 1000],
System_3: [submitted: 1055]
}
I currently have this working doing queries like below, which returns the count of processing statuses for System_1 and repeat for the other systems and package into a JSON return.
Item.objects.filter(....).annotate(
system_1_status=Subquery(
Record.objects.filter(
system='System_1',
items__id=OuterRef('pk')
).order_by('-created').values('status')[:1]
)
).values('system_1_status').annotate(count=Count('system_1_status'))
This converts to sql query:
SELECT
"api_item"."id",
"api_item"."details",
(
SELECT
U0."status"
FROM
"api_record" U0
INNER JOIN
"api_record_items" U1
ON
(U0."id" = U1."record_id")
WHERE
(U1."item_id" = ("api_item"."id") AND U0."system" = system_1)
ORDER BY
U0."created" DESC LIMIT 1
) AS "system_1_status"
FROM "api_item"
We have millions of Items and Records and this works reasonably well if we select less than a thousand Items. Above this it takes minutes. Trying to do it for hundreds of thousands of items is catastrophic.
Questions
Can I improve the performance of this query? I can't see how, other than playing with indexes?
Alternatively would it be a bad idea to add a JSONField to the Item models which stores a cache of the latest statuses for each system for that Item? Whilst I don't like the idea of duplicating the data, doing an aggregation over a field already on the Item model should be very quick at querytime. I have DjangoQ which I can use scheduled functions to keep these fields up to date.
CodePudding user response:
You are doing one subquery per item, try doing something like below and use aggregations.
from django.db.models import Case, When, Sum
items = Item.objects.filter(# your condition)
results = Record.objects.values("system").annotate(
running=Sum(Case(When(status="running", then=1), default=0,
output_field=IntegerField())),
completed=Sum(Case(When(status="completed", then=1), default=0,
output_field=IntegerField())),
# add more status annotations
).order_by("system").filter(items=items)
CodePudding user response:
I think a simple group by might be useful to get the status;
from django.db.models import Count
Item.objects.filter(record__system='System_1').values('record__status').annotate(c=Count('record__status')).values('record__status', 'c')