Home > Enterprise >  What is the best approach to synchronizing large database tables across EC Core contexts?
What is the best approach to synchronizing large database tables across EC Core contexts?

Time:07-11

My Scenario

I have three warehouse databases (Firebird) numbered 1, 2 and 3, each sharing the same scheme, and the same DbContext class. The following is the model of the Products table:

public class Product
{
    public string Sku { get; }
    public string Barcode { get; }
    public int Quantity { get; }
}

I also have a local "Warehouse Cache" database (MySQL) where I want to periodically download the contents of all three warehouses for caching reasons. The data model of a cached product is similar, with the addition of a number denoting the source warehouse index. This table should contain all product information from all three warehouses. If a product appears in both warehouses 1 and 3 (same Sku), then I want to have two entries in the local Cache table, each with the corresponding warehouse ID:

public class CachedProduct
{
    public int WarehouseId { get; set; } // Can be either 1, 2 or 3
    public string Sku { get; }
    public string Barcode { get; }
    public int Quantity { get; }
}

There are multiple possible solutions to this problem, but given the size of my datasets (~20k entries per warehouse), none of them seem viable or efficient, and I'm hoping that someone could give me a better solution.

The problem

If the local cache database is empty, then it's easy. Just download all products from all three warehouses, and dump them into the cache DB. However on subsequent synchronizations, the cache DB will no longer be empty. In this case, I don't want to add all 60k products again, because that would be a tremendous waste of storage space. Instead, I would like to "upsert" the incoming data into the cache, so new products would be inserted normally, but if a product already exists in the cache (matching Sku and WarehouseId), then I just want to update the corresponding record (e.g. the Quantity could have changed in one of the warehouses since the last sync). This way, the no. of records in the cache DB will always be exactly the sum of the three warehouses; never more and never less.

Things I've tried so far

The greedy method: This one is probably the simplest. For each product in each warehouse, check if a matching record exists in the cache table. If it does then update, otherwise insert. The obvious problem is that there is no way to batch/optimize this, and it would result in tens of thousands of select, insert and update calls being executed on each synchronization.

:Clearing the Cache: Clear the local cache DB before every synchronization, and re-download all the data. My problem with this one is that it leaves a small window of time when no cache data will be available, which might cause problems with other parts of the application.

Using an EF-Core "Upsert" library: This one seemed the most promising with the FlexLabs.Upsert library, since it seemed to support batched operations. Unfortunately the library seems to be broken, as I couldn't even get their own minimal example to work properly. A new row in inserted on every "upsert", regardless of the matching rule.

Avoiding EF Core completely: I have found a library called Dotmim.Sync that seems to be a DB-to-DB synchronization library. The main issue with this is that the warehouses are running FirebirdDB which doesn't seem to be supported by this library. Also, I'm not sure if I could even do data transformation, since I have to add the WarehouseId column before a row is added to the cache DB.


Is there a way to do this as efficiently as possible in EF Core?

CodePudding user response:

There are a couple of options here. Which ones are viable depends on your staleness constraints for the Cache. Must the cache always 100% relect the warehouse state, or can it get out of sync for a period of time.

First, you absolutely should not use EFCore for this, except possibly as a client lib to do raw SQL. EfCore is optimized for many small transactions. It doesn't do great with batch workloads.

The 'best' option is probably an event based system. Firebird supports emitting events to an event listener, which would then update the cache based on the events. The risk here is if event processing fails, you could get out of sync. You could mitigate that risk by using an event bus of some sort (Rabbit, Kafka), but Firebird event handling itself would be the weak link.

If the cache can handle some inconsistency, you could attach a expiry timestamp to each cache entry. Your application hits the cache, and if the expiry date is past, it rechecks the warehouse dbs. Depending on the business process that update the source of truth databases you may also be able to bust cache entries (e.g. if there's an order management system, it can bust the cache for a line item when someone makes an order).

If you have to batch sync, do a swap table. Set up a table with the live cache data, a separate table you load the new cache data in, and a flag in your application that says which yo read from. You read from table A while you load into B, then when the load is done, you swap to read from table B.

CodePudding user response:

For now I ended up going with a simple, yet effective solution that is fully within EF Core.

For each cache entry, I also maintain a SyncIndex column. During synchronization, I download all products from all three warehouses, I set SyncIndex to max(cache.SyncIndex) 1, and I dump them into the cache database. Then I delete all entries from the cache with an older SyncIndex. This way I always have some cache data available, I don't waste a lot of space, and the speed is pretty acceptable too.

  • Related