Home > front end >  Slow process to update 14 million rows
Slow process to update 14 million rows

Time:09-14

I need to update the values and location/other information for 14 million records. I have the 14 million records in interval_list. Each iteration I use the MeterID from the Interval object to get the correct location and multiplier. I than update the location and update the value based on the multiplier. This code takes 4 hours to run. Is there a better way to do this? Thanks.

foreach (Interval interval in interval_list)
{       
    var result = from m in meterList
                 where m.MeterID.Equals(interval.MeterID)
                 where m.StartDate < (interval.UTCDateTime.ToLocalTime())
                 where m.FinalDate > (interval.UTCDateTime.ToLocalTime())
                 select m;

    if (result.Count() == 1) //Ignore if > 1
    {
        int mult1 = result.ElementAt(0).Mult1;
        int mult2 = result.ElementAt(0).Mult2;
        interval.ServiceID = result.ElementAt(0).Locserv;
                                       
        // With updating check to see if value is already adjusted by mult

        if (interval.Mult1 > 1)
        {
            interval.Value = interval.Value / interval.Mult1;
            interval.Value = interval.Value * mult1;
        }

        interval.Mult1 = mult1; //MULT1
        interval.Mult2 = mult2; //MULT2
    }
}

public class Interval
{
    public string MeterID { get; set; }
    public DateTime UTCDateTime { get; set; }
    public float Value { get; set; }
    public DateTime LocalDateTime { get; set; }
    public string ServiceID { get; set; }
    public string Account { get; set; }
    public string Rate { get; set; }
    public int Mult1 { get; set; }
    public int Mult2 { get; set; }
}

public class Meters
    {
        public string MeterID { get; set; }
        public string Locserv { get; set; }
        public DateTime StartDate { get; set; }
        public DateTime FinalDate { get; set; }
        public int Mult1 { get; set; }
        public int Mult2 { get; set; }

    }

CodePudding user response:

Probably you have a CPU bottleneck, roughly speaking - all your code is executed using 1 CPU core, so if you have more of them - lets utilize them with Parallel.ForEach. (But if it is not CPU - then welp...)

Next thing - you have 2 evaluations of interval.UTCDateTime.ToLocalTime() - why not evaluate this value before LINQ.

Parallel.ForEach(
    interval_list,
    new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
    interval =>
{
    var intervalUtcDateTimeLocalTime = interval.UTCDateTime.ToLocalTime();
    var result = from m in meterList
                 where m.MeterID.Equals(interval.MeterID)
                 where m.StartDate < (intervalUtcDateTimeLocalTime)
                 where m.FinalDate > (intervalUtcDateTimeLocalTime)
                 select new { m.Mult1, m.Mult2, m.Locserv };

    if (result.Count() == 1) //Ignore if > 1
    {
        int mult1 = result.ElementAt(0).Mult1;
        int mult2 = result.ElementAt(0).Mult2;
        interval.ServiceID = result.ElementAt(0).Locserv;

        // With updating check to see if value is already adjusted by mult

        if (interval.Mult1 > 1)
        {
            interval.Value = interval.Value / interval.Mult1;
            interval.Value = interval.Value * mult1;
        }

        interval.Mult1 = mult1; //MULT1
        interval.Mult2 = mult2; //MULT2
    }
});

(Unfortunately, I haven't done this in a long time, please correct me if I missed something.)

Thanks to @sebastian-siemens comment

It is also more efficient to already select the desired values in linq. select new {ElementAt(0).Mult1, ElementAt(0).Mult2, result.ElementAt(0).Locserv}

CodePudding user response:

The basic problem is that you are running 14 million queries based on the size of interval_list mentioned in the question.

Even if your round-trip time to the database server is 1ms, I would expect this can complete in no less than 3.9 hours. (14000000 / 1000 / 60 / 60 = 3.88888888...)

Best way to improve the overall speed is to perform batch retrieval (select multiple intervals in a single query). Create a configuration option so you can test out different batch sizes such as 10, 100, 1000, etc. Update the query to obtain multiple records in a single query based on the batch size. You might find ActionBlock class to be useful for creating the pipeline and controlling the level of concurrency for the database queries.

  • Related