It's been some time now with ArangoDB. Development with it along with Python is very straight forward and quick.
However, I am recently facing some issues with the production side, which is a bare metal with these specs:
- Intel Xeon E-2386G 6c/12t @ 3.50GHz
- 32 GB ECC 3200 MHz
- 2x 512 GB NVME SSD in RAID 1
The ArangoDB versions are:
LOCAL:
- Fedora 36
- arangodb3 3.8.7
PRODUCTION:
- Rocky Linux 9
- arangodb3 3.9.3
The issue is the long time of quite simple queries which in local are a matter of milliseconds, while in production a matter of whole seconds. I also should point out that the two collections used in the query are bigger in production (searches has 41k docs, results has 350k docs) than in local (searches has 1k docs, results has 27k docs), however I did notice a slow habit in production even with smaller numbers.
Here's the production profiled query:
Query String (415 chars, cacheable: false):
FOR entry in searches
FILTER entry.session_key == '21150583'
LET counters = (
FOR result in results_linkers
FILTER result.search_key == entry._key AND result.validation_automatic.status == 'completed'
COLLECT action = result.validation_automatic.action WITH COUNT INTO action_counter
RETURN { 'action': action, 'count': action_counter }
)
LIMIT 20
RETURN { 'search': entry, 'counters': counters }
Execution plan:
Id NodeType Calls Items Runtime [s] Comment
1 SingletonNode 1 1 0.00000 * ROOT
2 EnumerateCollectionNode 1 20 0.00737 - FOR entry IN searches /* full collection scan */ FILTER (entry.`session_key` == "21150583") /* early pruning */
14 LimitNode 1 20 0.00001 - LIMIT 0, 20
18 SubqueryStartNode 1 40 0.00001 - LET counters = ( /* subquery begin */
6 EnumerateCollectionNode 7136 7135580 3.33341 - FOR result IN results_linkers /* full collection scan, projections: `search_key`, `validation_automatic` */
7 CalculationNode 7136 7135580 1.36629 - LET #9 = ((result.`search_key` == entry.`_key`) && (result.`validation_automatic`.`status` == "completed")) /* simple expression */ /* collections used: result : results_linkers, entry : searches */
8 FilterNode 2 1854 0.12298 - FILTER #9
9 CalculationNode 2 1854 0.00024 - LET #11 = result.`validation_automatic`.`action` /* attribute expression */ /* collections used: result : results_linkers */
10 CollectNode 1 56 0.00013 - COLLECT action = #11 AGGREGATE action_counter = LENGTH() /* hash */
17 SortNode 1 56 0.00001 - SORT action ASC /* sorting strategy: standard */
11 CalculationNode 1 56 0.00002 - LET #13 = { "action" : action, "count" : action_counter } /* simple expression */
19 SubqueryEndNode 1 20 0.00001 - RETURN #13 ) /* subquery end */
15 CalculationNode 1 20 0.00001 - LET #15 = { "search" : entry, "counters" : counters } /* simple expression */ /* collections used: entry : searches */
16 ReturnNode 1 20 0.00000 - RETURN #15
Indexes used:
none
Optimization rules applied:
Id RuleName
1 move-calculations-up
2 move-filters-up
3 move-calculations-up-2
4 move-filters-up-2
5 move-calculations-down
6 reduce-extraction-to-projection
7 move-filters-into-enumerate
8 splice-subqueries
Query Statistics:
Writes Exec Writes Ign Scan Full Scan Index Filtered Peak Mem [b] Exec Time [s]
0 0 7176820 0 7174966 655360 4.83091
Query Profile:
Query Stage Duration [s]
initializing 0.00000
parsing 0.00004
optimizing ast 0.00001
loading collections 0.00000
instantiating plan 0.00003
optimizing plan 0.00031
executing 4.83051
finalizing 0.00002
Here's the local one:
Query String (414 chars, cacheable: false):
FOR entry in searches
FILTER entry.session_key == '3307542'
LET counters = (
FOR result in results_linkers
FILTER result.search_key == entry._key AND result.validation_automatic.status == 'completed'
COLLECT action = result.validation_automatic.action WITH COUNT INTO action_counter
RETURN { 'action': action, 'count': action_counter }
)
LIMIT 20
RETURN { 'search': entry, 'counters': counters }
Execution plan:
Id NodeType Calls Items Runtime [s] Comment
1 SingletonNode 1 1 0.00001 * ROOT
2 EnumerateCollectionNode 1 1 0.00097 - FOR entry IN searches /* full collection scan */ FILTER (entry.`session_key` == "3307542") /* early pruning */
14 LimitNode 1 1 0.00001 - LIMIT 0, 20
18 SubqueryStartNode 1 2 0.00001 - LET counters = ( /* subquery begin */
6 EnumerateCollectionNode 28 27775 0.02148 - FOR result IN results_linkers /* full collection scan, projections: `search_key`, `validation_automatic` */
7 CalculationNode 28 27775 0.00866 - LET #9 = ((result.`search_key` == entry.`_key`) && (result.`validation_automatic`.`status` == "completed")) /* simple expression */ /* collections used: result : results_linkers, entry : searches */
8 FilterNode 1 97 0.00169 - FILTER #9
9 CalculationNode 1 97 0.00002 - LET #11 = result.`validation_automatic`.`action` /* attribute expression */ /* collections used: result : results_linkers */
10 CollectNode 1 3 0.00002 - COLLECT action = #11 AGGREGATE action_counter = LENGTH() /* hash */
17 SortNode 1 3 0.00001 - SORT action ASC /* sorting strategy: standard */
11 CalculationNode 1 3 0.00001 - LET #13 = { "action" : action, "count" : action_counter } /* simple expression */
19 SubqueryEndNode 1 1 0.00001 - RETURN #13 ) /* subquery end */
15 CalculationNode 1 1 0.00001 - LET #15 = { "search" : entry, "counters" : counters } /* simple expression */ /* collections used: entry : searches */
16 ReturnNode 1 1 0.00001 - RETURN #15
Indexes used:
none
Optimization rules applied:
Id RuleName
1 move-calculations-up
2 move-filters-up
3 move-calculations-up-2
4 move-filters-up-2
5 move-calculations-down
6 reduce-extraction-to-projection
7 move-filters-into-enumerate
8 splice-subqueries
Query Statistics:
Writes Exec Writes Ign Scan Full Scan Index Filtered Peak Mem [b] Exec Time [s]
0 0 28901 0 28804 360448 0.03434
Query Profile:
Query Stage Duration [s]
initializing 0.00002
parsing 0.00012
optimizing ast 0.00002
loading collections 0.00001
instantiating plan 0.00007
optimizing plan 0.00117
executing 0.03293
finalizing 0.00003
Another detail to add up to the story is that I'm using the database with "single" big collections and retrieve the data looking up for the keys, like an old fashioned relational database. Is that also the wrong approach here? The collections are going to grow over the time so I'm a bit concerned on this.
Thanks for whoever will help me out on this!
CodePudding user response:
Is your internet slow? I would try using traceroute to determine if your internet is the cause of the slow queries. Usually differences between local query speeds and production are related to resource usage or internet speed. That has been my experience. You can also check your resource usage while the query is running using the top command.
Those two should help a lot in determining if the cause if network or resource related. If neither of those really show the bottleneck then it could be a configuration issue. Unless your local dev and production environments are exactly the same config wise.
Let me know what you discover using those.
CodePudding user response:
I'd recommend you look at adding indexes to your collections.
The first index I would add is a Persistent index on the 'searches' collection, specifically on the key 'session_key'.
The second index I would add is a Persistent index on the 'results_linkers' collection, specifically on the key 'validation_automatic.status'. With this index, also test adding the second key to the index for the key 'result.validation_automatic.action', it's possible this will help in the collect command.
After adding the indexes, do the same query analysis and confirm your indexes are being picked up in the 'Indexes Used' section. If you aren't getting your indexes listed there, then they perform no advantage and you need to change the keys they index on to see if they trigger.
After adding an index you don't need to restart the server. When the next AQL query hits the server it will determine if any indexes help.
In my experience, indexes can improve query performance from 10x to 1000x depending on your query.
See if that helps.
When developing locally with a small data set, it's very easy to think your queries perform well only to find out in production they run slow.
The best way around this is to have a 'PVT - Production Verification Testing' space where you can test your new code on a database with a restore of production data, if you're able to do that.
See if these indexes help, and tune them if not, or ignore the index if it doesn't assist.