Home > database >  Azure CosmosDB Bulk Insert with Batching
Azure CosmosDB Bulk Insert with Batching

Time:08-19

I'm using Azure CosmosDB with Azure Function. In function, I'm trying to save around 27 Million records.

The method AddBulk looks as below:

public async Task AddBulkAsync(List<TEntity> documents)
{
    try
    {
        List<Task> concurrentTasks = new();
        foreach (var item in documents)
        {
             concurrentTasks.Add(_container.CreateItemAsync<TEntity>(item, new PartitionKey(Convert.ToString(_PartitionKey?.GetValue(item)))));
        }
        await Task.WhenAll(concurrentTasks);
   }
   catch
   {
        throw;
   }
}

I've around 27 Million records categorized by a property called batch. That means each batch holds around 82-85K records.

Currently, I'm creating a list of 27 Million records and sending that list to AddBulkAsync. It is causing a problem in terms of Memory and Throughput. Instead, I'm planning to send the data in batch:

  1. Get distinct batchId from all the docs and create a batch list.
  2. Loop batch list and get documents (i.e around 85K records) for the first batch (in a loop).
  3. Call AddBulk method for documents (i.e around 85K records) of the first batch.
  4. Continue the loop for the next batchId and so on.

Here, I wanted to understand as it calls AddBulk thousands of times and in turn, AddBulk calls await Task.WhenAll(concurrentTasks) those many times internally.

Will it cause a problem? Are there any better approaches available to achieve such a scenario? Please guide.

CodePudding user response:

Some suggestions in scenarios like this.

When doing bulk operations like these, be sure you enable bulkMode = true in the Connection Options for the Cosmos client when instantiating. Bulk Mode works by queuing up divided into groups based upon a partition key range for a physical partition within a container. (Logical key values, when hashed, fall within a range of hashes that are then mapped to the physical partition where they reside). As the queue fills up the Cosmos client will dispatch all the queued-up items for that partition. If it does not fill up, it will dispatch what it has, full or not.

One way you can help with this is to try to batch updates such that the number of items you send are roughly evenly spread across logical partition key values within your container. This will allow the Cosmos client to more fully saturate the available throughput with batches that are as full as possible and sent in parallel. When data is not evenly distributed across logical partition key values, the volume of data ingested will be slower since dispatches will be sent as the queues fill up. It is less efficient and will not fully utilize every ounce of throughput provisioned.

One other thing to suggest is, if at all possible, stream your data rather than batch it. Cosmos DB measures provisioned throughput per second (RU/s). You get only the amount of throughput you've provisioned on a per second basis. Anything over that and you are either rate limited (i.e. 429s) or the speed of operations is slower than it otherwise could be. This is because throughput is evenly spread across all partitions. If you provision 60K RU/s you'll end up with ~6 physical partitions, each of them get 10K RU/s. There is no sharing of throughput. As a result, operations like this are less efficient than they could be.

One effective way to deal with deal with all of this is to amortize your throughput over a longer period of time. Streaming is the perfect way to accomplish this. Streaming data, in general, requires less overall provisioned throughput because you are utilizing throughput over a longer period of time. This allows you to work with a nominal amount of throughput, rather than having to scale up and down again and is just overall more efficient. If streaming from the source is not an option, you can accomplish this with queue-based load leveling by first sending data to a queue, then streaming from there into Cosmos DB.

Either way you go, these docs can help squeeze the most performance out of Azure Functions hosting bulk ingestion workloads into Cosmos DB.

  • Related