ArangoDB query is taking a lot time on server side than in local-CodePudding

It's been some time now with ArangoDB. Development with it along with Python is very straight forward and quick.

However, I am recently facing some issues with the production side, which is a bare metal with these specs:

Intel Xeon E-2386G 6c/12t @ 3.50GHz
32 GB ECC 3200 MHz
2x 512 GB NVME SSD in RAID 1

The ArangoDB versions are:

LOCAL:

Fedora 36
arangodb3 3.8.7

PRODUCTION:

Rocky Linux 9
arangodb3 3.9.3

The issue is the long time of quite simple queries which in local are a matter of milliseconds, while in production a matter of whole seconds. I also should point out that the two collections used in the query are bigger in production (searches has 41k docs, results has 350k docs) than in local (searches has 1k docs, results has 27k docs), however I did notice a slow habit in production even with smaller numbers.

Here's the production profiled query:

Query String (415 chars, cacheable: false):
 FOR entry in searches
 FILTER entry.session_key == '21150583'
 LET counters = (
     FOR result in results_linkers
     FILTER result.search_key == entry._key AND result.validation_automatic.status == 'completed'
     COLLECT action = result.validation_automatic.action WITH COUNT INTO action_counter
     RETURN { 'action': action, 'count': action_counter }
 )
 LIMIT 20
 RETURN { 'search': entry, 'counters': counters }
 

Execution plan:
 Id   NodeType                  Calls     Items   Runtime [s]   Comment
  1   SingletonNode                 1         1       0.00000   * ROOT
  2   EnumerateCollectionNode       1        20       0.00737     - FOR entry IN searches   /* full collection scan  */   FILTER (entry.`session_key` == "21150583")   /* early pruning */
 14   LimitNode                     1        20       0.00001       - LIMIT 0, 20
 18   SubqueryStartNode             1        40       0.00001       - LET counters = ( /* subquery begin */
  6   EnumerateCollectionNode    7136   7135580       3.33341         - FOR result IN results_linkers   /* full collection scan, projections: `search_key`, `validation_automatic`  */
  7   CalculationNode            7136   7135580       1.36629           - LET #9 = ((result.`search_key` == entry.`_key`) && (result.`validation_automatic`.`status` == "completed"))   /* simple expression */   /* collections used: result : results_linkers, entry : searches */
  8   FilterNode                    2      1854       0.12298           - FILTER #9
  9   CalculationNode               2      1854       0.00024           - LET #11 = result.`validation_automatic`.`action`   /* attribute expression */   /* collections used: result : results_linkers */
 10   CollectNode                   1        56       0.00013           - COLLECT action = #11 AGGREGATE action_counter = LENGTH()   /* hash */
 17   SortNode                      1        56       0.00001           - SORT action ASC   /* sorting strategy: standard */
 11   CalculationNode               1        56       0.00002           - LET #13 = { "action" : action, "count" : action_counter }   /* simple expression */
 19   SubqueryEndNode               1        20       0.00001           - RETURN  #13 ) /* subquery end */
 15   CalculationNode               1        20       0.00001       - LET #15 = { "search" : entry, "counters" : counters }   /* simple expression */   /* collections used: entry : searches */
 16   ReturnNode                    1        20       0.00000       - RETURN #15

Indexes used:
 none

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   move-filters-up
  3   move-calculations-up-2
  4   move-filters-up-2
  5   move-calculations-down
  6   reduce-extraction-to-projection
  7   move-filters-into-enumerate
  8   splice-subqueries

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Filtered   Peak Mem [b]   Exec Time [s]
           0            0     7176820            0    7174966         655360         4.83091

Query Profile:
 Query Stage           Duration [s]
 initializing               0.00000
 parsing                    0.00004
 optimizing ast             0.00001
 loading collections        0.00000
 instantiating plan         0.00003
 optimizing plan            0.00031
 executing                  4.83051
 finalizing                 0.00002

Here's the local one:

Query String (414 chars, cacheable: false):
 FOR entry in searches
 FILTER entry.session_key == '3307542'
 LET counters = (
     FOR result in results_linkers
     FILTER result.search_key == entry._key AND result.validation_automatic.status == 'completed'
     COLLECT action = result.validation_automatic.action WITH COUNT INTO action_counter
     RETURN { 'action': action, 'count': action_counter }
 )
 LIMIT 20
 RETURN { 'search': entry, 'counters': counters }
 

Execution plan:
 Id   NodeType                  Calls   Items   Runtime [s]   Comment
  1   SingletonNode                 1       1       0.00001   * ROOT
  2   EnumerateCollectionNode       1       1       0.00097     - FOR entry IN searches   /* full collection scan  */   FILTER (entry.`session_key` == "3307542")   /* early pruning */
 14   LimitNode                     1       1       0.00001       - LIMIT 0, 20
 18   SubqueryStartNode             1       2       0.00001       - LET counters = ( /* subquery begin */
  6   EnumerateCollectionNode      28   27775       0.02148         - FOR result IN results_linkers   /* full collection scan, projections: `search_key`, `validation_automatic`  */
  7   CalculationNode              28   27775       0.00866           - LET #9 = ((result.`search_key` == entry.`_key`) && (result.`validation_automatic`.`status` == "completed"))   /* simple expression */   /* collections used: result : results_linkers, entry : searches */
  8   FilterNode                    1      97       0.00169           - FILTER #9
  9   CalculationNode               1      97       0.00002           - LET #11 = result.`validation_automatic`.`action`   /* attribute expression */   /* collections used: result : results_linkers */
 10   CollectNode                   1       3       0.00002           - COLLECT action = #11 AGGREGATE action_counter = LENGTH()   /* hash */
 17   SortNode                      1       3       0.00001           - SORT action ASC   /* sorting strategy: standard */
 11   CalculationNode               1       3       0.00001           - LET #13 = { "action" : action, "count" : action_counter }   /* simple expression */
 19   SubqueryEndNode               1       1       0.00001           - RETURN  #13 ) /* subquery end */
 15   CalculationNode               1       1       0.00001       - LET #15 = { "search" : entry, "counters" : counters }   /* simple expression */   /* collections used: entry : searches */
 16   ReturnNode                    1       1       0.00001       - RETURN #15

Indexes used:
 none

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   move-filters-up
  3   move-calculations-up-2
  4   move-filters-up-2
  5   move-calculations-down
  6   reduce-extraction-to-projection
  7   move-filters-into-enumerate
  8   splice-subqueries

Query Statistics:
 Writes Exec   Writes Ign   Scan Full   Scan Index   Filtered   Peak Mem [b]   Exec Time [s]
           0            0       28901            0      28804         360448         0.03434

Query Profile:
 Query Stage           Duration [s]
 initializing               0.00002
 parsing                    0.00012
 optimizing ast             0.00002
 loading collections        0.00001
 instantiating plan         0.00007
 optimizing plan            0.00117
 executing                  0.03293
 finalizing                 0.00003

Another detail to add up to the story is that I'm using the database with "single" big collections and retrieve the data looking up for the keys, like an old fashioned relational database. Is that also the wrong approach here? The collections are going to grow over the time so I'm a bit concerned on this.

Thanks for whoever will help me out on this!

CodePudding user response：

Is your internet slow? I would try using traceroute to determine if your internet is the cause of the slow queries. Usually differences between local query speeds and production are related to resource usage or internet speed. That has been my experience. You can also check your resource usage while the query is running using the top command.

Those two should help a lot in determining if the cause if network or resource related. If neither of those really show the bottleneck then it could be a configuration issue. Unless your local dev and production environments are exactly the same config wise.

Let me know what you discover using those.

CodePudding user response：

I'd recommend you look at adding indexes to your collections.

The first index I would add is a Persistent index on the 'searches' collection, specifically on the key 'session_key'.

The second index I would add is a Persistent index on the 'results_linkers' collection, specifically on the key 'validation_automatic.status'. With this index, also test adding the second key to the index for the key 'result.validation_automatic.action', it's possible this will help in the collect command.

After adding the indexes, do the same query analysis and confirm your indexes are being picked up in the 'Indexes Used' section. If you aren't getting your indexes listed there, then they perform no advantage and you need to change the keys they index on to see if they trigger.

After adding an index you don't need to restart the server. When the next AQL query hits the server it will determine if any indexes help.

In my experience, indexes can improve query performance from 10x to 1000x depending on your query.

See if that helps.

When developing locally with a small data set, it's very easy to think your queries perform well only to find out in production they run slow.

The best way around this is to have a 'PVT - Production Verification Testing' space where you can test your new code on a database with a restore of production data, if you're able to do that.

See if these indexes help, and tune them if not, or ignore the index if it doesn't assist.