Home > Blockchain >  How do I get the count of a result set efficiently?
How do I get the count of a result set efficiently?

Time:01-11

I've got a function which interacts with a postgres DB.

The function takes a parameter called pagination_data_required (boolean).
If pagination_required is set to true, the function executes a query as well as a query.count() which according to the documentation docs.peewee link here, puts the query in a wrapped count() function.

def list_records(pagination_data_required):
       query = table1.select(table1.columns...).join(table2....).distinct() ## returns nearly 500k rows

 if (filter_request_body.pagination_data_required):
       total_count = query.count()

My problem arises when .count() is called. Without a .count() my api returns results within a second whereas with .count(), the response time skyrockets to ~18 seconds.
I need this total count due to a requirement from the frontend team.

The query is returning roughly 500k records (which is needed, plus there's a .paginate() function being called)

How do I efficiently count the number of rows returned in query ?

I've tried the api with pagination_data_required True and Falseand the results remain the same.
I've tried to call .dicts() on the original query and take the count of items but it gives the same response time.

CodePudding user response:

The only way to count the number of rows returned by a query is to execute the query and count the results. I don't know how your ORM implements pagination, but I assume that it will append a LIMIT clause at the end of the query. That can speed up execution, because only the first few rows of the result set have to be calculated. But calculating the count will take much longer for a large result set.

So there is no good solution for this problem other than not showing an exact result count. See my article for a discussion of the problem and potential workarounds.

CodePudding user response:

This one's a classic.

It is usually not possible to count the rows returned by a query without actually running the query. If it includes things that don't change the count, like left joins, sorts, joins on foreign keys that don't add or remove rows, etc, then you could remove them and get a bit of a speedup, but you will still be running the query. But if it is using a LIMIT'ed index scan for efficient searching of the most recent rows (for exampel) then that optimization won't work with a count. Also reading such a large amount of useless data will trash your cache. If the count query is run often, all the data it uses will fill your cache, and evict data that other more useful queries need, this will make these queries slow. Or you will have to upgrade your RAM.

In some cases, like a forum, displaying a topic always uses the same search criteria. It is simply "where topic_id=... order by post_id". In this case, counting the posts is very wasteful, always doing the exact same query all over again, and paginating results with (LIMIT OFFSET) is also slow as it discards all the selected rows before the requested offset. Since the most often requested page is the last one, the worst case is the most common.

However, with such fixed search and ordering criteria, the row number of any row in the result set is always the same, so it is possible to cache it as "post number in topic" in the posts table. Then, to get one specific page, it is simply a matter of "post_number BETWEEN ... AND ...", and to count the posts in a topic, just select the post_number of the last one. In this case it is possible to get the exact count without actually counting, and to paginate without using OFFSET, which is much faster.

For a generic search query that can use many criteria, it is not possible to store the row number in such a simple way. However, knowing the exact count is usually not necessary. When the GUI displays:

Page: 1 2 3 4 .... 50000 50001

Will a user ever navigate to page 837? Probably not. What users do in this case is use sort to get the result they want on top, or refine their search criteria to reduce the number of results to something manageable. So the time spent in this huge count() query is almost always wasted. Basically, the information that is relevant to the user is: are there few pages, so it's possible to scan them by eye, or are there a lot, so he should refine his search criteria?

This does not need an accurate count, so the easiest way to fix this is to limit the counted results to something that would fill a number of pages like 5 or 10. Instead of:

SELECT count(*) FROM ...

use:

SELECT count(*) FROM (subquery ORDER BY ... LIMIT ...) AS foo

The next step is to realize selecting a few pages of results will quite often be almost as fast as selecting one page, so this is a good opportunity to cache the results for at least the first few pages when the first page is requested. This allows getting rid of the count, as you retrieve more results than necessary.

It is also possible to return the first few pages to the client and paginate on the client side using javascript, which means side queries.

Quite often the user will click on the last page instead of reversing the order, in this case you should flip the ORDER BY direction to keep a small LIMIT, not count all the rows and use a huge OFFSET to skip all pages except the last. When using the correct ORDER BY direction depending on which page is requested, the most common ones (first and last page) are fastest, with the worst case being in the middle, which is rarely clicked.

Another option is to cache the counts. The largest counts will most likely be for queries involving few search criteria, perhaps with common values, which results in a few combinations that can be cached beforehand. In addition, if the user clicks on page 2, reuse the cached count from the previous page. Of course the counts won't be exact, but that doesn't matter. It would only matter if the pagination logic was done wrong, ie not flipping the ORDER BY for pages close to the last one are requested.

I need this total count due to a requirement from the frontend team.

It's not possible, so the frontend team needs to read the answers to your question and act accordingly.

  • Related