I need to update the values and location/other information for 14 million records. I have the 14 million records in interval_list
. Each iteration I use the MeterID
from the Interval
object to get the correct location and multiplier. I than update the location and update the value based on the multiplier. This code takes 4 hours to run. Is there a better way to do this? Thanks.
foreach (Interval interval in interval_list)
{
var result = from m in meterList
where m.MeterID.Equals(interval.MeterID)
where m.StartDate < (interval.UTCDateTime.ToLocalTime())
where m.FinalDate > (interval.UTCDateTime.ToLocalTime())
select m;
if (result.Count() == 1) //Ignore if > 1
{
int mult1 = result.ElementAt(0).Mult1;
int mult2 = result.ElementAt(0).Mult2;
interval.ServiceID = result.ElementAt(0).Locserv;
// With updating check to see if value is already adjusted by mult
if (interval.Mult1 > 1)
{
interval.Value = interval.Value / interval.Mult1;
interval.Value = interval.Value * mult1;
}
interval.Mult1 = mult1; //MULT1
interval.Mult2 = mult2; //MULT2
}
}
public class Interval
{
public string MeterID { get; set; }
public DateTime UTCDateTime { get; set; }
public float Value { get; set; }
public DateTime LocalDateTime { get; set; }
public string ServiceID { get; set; }
public string Account { get; set; }
public string Rate { get; set; }
public int Mult1 { get; set; }
public int Mult2 { get; set; }
}
public class Meters
{
public string MeterID { get; set; }
public string Locserv { get; set; }
public DateTime StartDate { get; set; }
public DateTime FinalDate { get; set; }
public int Mult1 { get; set; }
public int Mult2 { get; set; }
}
CodePudding user response:
Probably you have a CPU bottleneck, roughly speaking - all your code is executed using 1 CPU core, so if you have more of them - lets utilize them with Parallel.ForEach
. (But if it is not CPU - then welp...)
Next thing - you have 2 evaluations of interval.UTCDateTime.ToLocalTime()
- why not evaluate this value before LINQ.
Parallel.ForEach(
interval_list,
new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
interval =>
{
var intervalUtcDateTimeLocalTime = interval.UTCDateTime.ToLocalTime();
var result = from m in meterList
where m.MeterID.Equals(interval.MeterID)
where m.StartDate < (intervalUtcDateTimeLocalTime)
where m.FinalDate > (intervalUtcDateTimeLocalTime)
select new { m.Mult1, m.Mult2, m.Locserv };
if (result.Count() == 1) //Ignore if > 1
{
int mult1 = result.ElementAt(0).Mult1;
int mult2 = result.ElementAt(0).Mult2;
interval.ServiceID = result.ElementAt(0).Locserv;
// With updating check to see if value is already adjusted by mult
if (interval.Mult1 > 1)
{
interval.Value = interval.Value / interval.Mult1;
interval.Value = interval.Value * mult1;
}
interval.Mult1 = mult1; //MULT1
interval.Mult2 = mult2; //MULT2
}
});
(Unfortunately, I haven't done this in a long time, please correct me if I missed something.)
Thanks to @sebastian-siemens comment
It is also more efficient to already select the desired values in linq. select new {ElementAt(0).Mult1, ElementAt(0).Mult2, result.ElementAt(0).Locserv}
CodePudding user response:
The basic problem is that you are running 14 million queries based on the size of interval_list mentioned in the question.
Even if your round-trip time to the database server is 1ms, I would expect this can complete in no less than 3.9 hours. (14000000 / 1000 / 60 / 60 = 3.88888888...)
Best way to improve the overall speed is to perform batch retrieval (select multiple intervals in a single query). Create a configuration option so you can test out different batch sizes such as 10, 100, 1000, etc. Update the query to obtain multiple records in a single query based on the batch size. You might find ActionBlock class to be useful for creating the pipeline and controlling the level of concurrency for the database queries.