Timeout while calling a service from Mark Logic-CodePudding

I have a build a service in MarkLogic and the service is consumed(GET Method) by a downstream application. In the REST endpoint, we have four parameters like startDate,endDate,seqStart and seqLength .

Total number of data which has to be send through the REST endpoint is 1.5M and we are sending it as batches of 25,000

I have noticed the Elapsed time is different for two executions which is having same batch size with different sequence start

1)seq start=1 seqLength=25000 ElapsedTime=40s
2)seq start=100000 seqLength=25000 ElapsedTime=70s

Why am I getting different Elapsed value for REST calls having same seqLength with different seqStart

I am using fn:subsequence in my CTS query. Is it normal behavior or do I need to make any changes in the service.

CodePudding user response：

This is actually a common issue not only in MarkLogic but many other DBMS and search engine systems too.

You can locally run a query like this to verify it:

fn:subsequence(cts:search(fn:doc(), cts:true-query(), 1, 10))

And then compare the elapsed time to a query such as:

fn:subsequence(cts:search(fn:doc(), cts:true-query(), 1000000, 10))

Essentially the problem is that MarkLogic has to solve the entire query and then individually generate and page through each result until it finishes building the results of the page/batch you desire.

The only way to speed this up is to calculate the total number of pages/batches and then iterate through in reverse order once you seek a page higher than the mid-point of your result set of pages.

But, pages toward the center of the result set will still always be slowest.

Something like the following should work to return pages closer to the end of the result set faster by building the pages in reverse order:

let $pageNumber := 1000
let $resultCount := xdmp:estimate(cts:search(cts:true-query()))
let $pageSize := 25000
let $totalPages := $resultCount div $pageSize
let $middlePage := $totalPages div 2
let $reverseOrder := if($pageNumber gt $middlePage) then fn:true() else fn:false()
let $searchOrder := if($reverseOrder) then "ascending" else "descending"
let $start := if($reverseOrder) then ($middlePage * $pageSize) - ($pageNumber * $pageSize - $middlePage * $pageSize) else $pageNumber * $pageSize
let $end := $start   $pageSize
return fn:subsequence(cts:search(fn:doc(), cts:true-query(), ($searchOrder)), $start, $end)

There are some other novel tricks you can use as well to build your export faster.

If the data set you are exporting is completely static or unsorted you could put a unique incremental ID into each document such as 1, 2, etc... Then you would just need to run:

cts:search(fn:doc(), cta:and-query((cts:element-range-query(xs:QName("id"), ">=" $start), (cts:element-range-query(xs:QName("id"), "<=" $end))

That would return just results that belong in the page.

If the set is unsorted or that is to say order doesn't matter then this approach is valid for a bulk export. If order does matter then the time and difficulty it takes to keep the ID up to date is probably not worth it unless changes to the data set is highly infrequent.

Another approach you can look at is using smaller batches, running multiple workers/exporters in parallel and then stitching the export back together yourself after its completed. Sounds like you're doing something similar to this already. I'm just suggesting you continue to further scale it out with more parallel workers. The problem you may run into is that you may get an incomplete export if the data set changes before the export finishes.