How to update bulk data using Entity Framework?-CodePudding

I have a table in my Oracle database named profile-list. This table has 30K rows. So users want to download this data with Excel and do changes. After changes are complete, they want to upload the Excel using a form and update the table.

How can I do this using Entity Framework?

var bulkData = ReadFromExcel(filePath);

using(var context = new DataCtx())
{
    var dbData = context.ProfileData.ToList();

    foreach(var item in dbData)
    {
        if (bulkData.Any(x => x.id = item.id)
             // update
        else
             // insert
    }
}

But this takes such a long time. How can I do this a faster way?

CodePudding user response：

You do not. Or - you do not with EF. Ef is an object/relational mapper and it's bulk operation capabilities are extremely limited.

If that is a "one of" (i.e. one specific use case, not for every table, but also one that happens more often) operation, I would likely define a table values parameter, and use a SP with that as input to bulk push updates if I need performance. It is quite trivial (couple of hours work) to define a mapper that maps that, generically.

For MORE data things would get more complicated. But then we talk regular doing 50k updates, many threads in parallel. Way above your use case.

But literally, Ef has a very specific use case. It does NOT handle bulk operations. Here is a hint, though - it is FAR from useless even with this. it takes 95% of the work out of 80% of the scenarios to handle data, including all the extremely flexible queries. That leaves you with the time to find and implement elegant solutions for the rest. And your case is exactly one of those EF does not handle efficiently.

Now, if you want to speed up your code, there is a fundamental error in it, and if there are few changes that alone may really take you down a lot in processing time:

var dbData = context.ProfileData.ToList();

That is a list. You then use

if(bulkData.Any(x=>x.id = item.id)

in a loop to find each row. That is VERY slow - on average, every ID has to run though half the list.

So, instead of doing a ToList, use ToDictionary with the id as key. In the loop you can then extract the individual items a LOT more efficient. A lot. Because the lookup is then not scanning half the list for every item.

ToDictionary(x => x.id)

Then at the end do ONE SaveChanges - but that will then be where you end up POSSIBLY taking a long time. Now, if there are not that many changes, this may be a lot more efficient and already put you into the "ok, usable" territory. It has to track changes (which takes time), but at least you do not have that really slow scanning of the list.

CodePudding user response：

EF is not that suited to bulk operations, and in essence bulk operations aren't something that you would generally want done as something like an action on a web app or anything that a user might "kick off" as this will take a while and certainly tie up resources on the server, plus lock rows etc. leading to potential deadlocks.

Still, if these risks are acceptable and this action is something that will be done infrequently by authorized, responsible users, then you can implement something with EF.

The biggest problem with this approach is loading everything at once. Exporting and re-importing data isn't an ideal approach to offer for data editing because you have no way of optimizing for what data the users actually changed. If they edited 10 rows out of the 30k, you're still processing 30k rows. It's far, far better to give them the ability to query the data and make changes via the application than export & re-import.

Processing 30k rows in one DbContext instance, whether you use one SaveChanges() at the end, or call SaveChanges() with each row update is going to be a noticeable performance drag. Rather than loading the entire table:

Read your Excel data into memory.
Fetch the data 100 rows at a time.
Update the rows, being sure to validate each row.
SaveChages() on that batch.
Handle exceptions.
Flush the DBContext instance between batches.

So something more like:

var bulkData = ReadFromExcel(filePath); // (#1)

for(int count = 0; count < bulkData.Count; count  = 100)
{
    var sourceData = bulkData.Skip(count).Take(100); // (#2)
    using (var context = new DataCtx())
    {
        try
        {
            var ids = sourceData.Select(x => x.id).ToList(); // Get IDs to load.
            var data = context.ProfileData.Where(x => ids.Contains(x.id)).ToList();
            foreach(var source in sourceData)
            {
                var dest = data.SingleOrDefault(x => x.id == source.id);
                if (dest == null)
                    // add.
                else
                    // validate update dest with details from source (#3)
             }
             context.SaveChanges(); // will commit this batch of 100. (#4)
        }
        catch(Exception e)
        { // What should happen if something slips through validation? these ~100 items won't be saved. (#5)
        }
    } // Context is disposed with each batch of 100 (#6)

} // Loop to next 100.

You may want to consider loading a subset of Excel data into memory at once, for instance 10k rows at a time or fewer to avoid too big of a memory hit. It depends on when/how often this might actually occur.

The biggest risk of something like batch processing is allowing it to be triggered by a user. What is to stop 100 users from uploading copies of data at the same time? or uploading data while other users are busy reading/updating data and having this trip row locks and deadlocks? At a minimum if this operation can be user triggered, the act of uploading an excel document for an update should upload the file, queue a request to process, and return a token that the user can submit again in a polling request to check on the status of their upload. A single background worker can ensure that only one upload is ever processed at a time, or even leave it queued until a scheduled time of day to be updated without impacting users.