Home > Mobile >  Copy LINQ query results into existing DataTable with faster performance
Copy LINQ query results into existing DataTable with faster performance

Time:03-19

I am looking for a way to improve the performance of C# code and would appreciate any help.

There are 2 tables: Table_1 and Table_2, and I want to collect data from Table_2 and save in Table_1 in the following form:

Table_1

Date Some_Stat
2021-06-01 23.7
2021-06-02 12.6
2021-06-03 47.9

Table_2

Date ID A B C
2021-06-02 4 21 23 13
2021-06-02 3 67 31 25
2021-06-01 3 45 54 33
2021-06-03 3 71 28 51
2021-06-03 4 26 83 24

My goal: Table_1 after join

Date Some_Stat A_3 B_3 C_3 A_4 B_4 C_4
2021-06-01 23.7 45 54 33 0 0 0
2021-06-02 12.6 67 31 25 21 23 13
2021-06-03 47.9 71 28 51 26 83 24

In order to achieve this I used the code below, which works, but is very slow, especially when there are thousands of IDs. Basically what the code does is making first the required join and transferring LINQ join results to existing Table_1 (copies the columns). I have checked the performance time and the LINQ query is always very fast (time: 0ms), but the data transfer is the problem.

List<int> ids = Table_2.AsEnumerable().Select(s => s.Field<int>("ID").Distinct().ToList();

for (int i = 0; ind < ids.Count; i  )
    {
       Table_1.Columns.Add($"A_{ids[i]}", typeof(double));
       Table_1.Columns.Add($"B_{ids[i]}", typeof(double));
       Table_1.Columns.Add($"C_{ids[i]}", typeof(double));
    }

for (int i = 0; ind < ids.Count; i  )
{
   // LINQ join (fast)
   var joinedTables = from T1 in Table_1.AsEnumerable()
            join T2 in Table_2.Select($"ID = {ids[i]}").AsEnumerable()
            on (String)T1["Date"] equals (String)T2["Date"]
            into T1_and_T2
            from TT in T1_and_T2.DefaultIfEmpty()
            select new
              {
                 Date = (String)T1["Date"],
                 A = TT != null ? (double)TT["A"] : 0.0,
                 B = TT != null ? (double)TT["B"] : 0.0,
                 C = TT != null ? (double)TT["C"] : 0.0,
              };
   // data transfer (very slow)
   for (int day = 0; day < joinedTables.Count(); day  )
   {
     Table_1.Rows[day][$"A_{ids[i]}"] = joinedTables.ElementAt(day).A;
     Table_1.Rows[day][$"B_{ids[i]}"] = joinedTables.ElementAt(day).B;
     Table_1.Rows[day][$"C_{ids[i]}"] = joinedTables.ElementAt(day).C;
   }
}

Also instead of the data transfer version above I have tried another way, but it is as slow as the previous:

int day = 0;
foreach( var row in joinedTables)
{
  Table_1.Rows[day][$"A_{ids[i]}"] = row.A;
  Table_1.Rows[day][$"B_{ids[i]}"] = row.B;
  Table_1.Rows[day][$"C_{ids[i]}"] = row.C;
}

Note: I am also open to new approaches on how to collect data from Table_2 in Table_1. There could be a way to use build-in functions (written in C or C ) which will access Machine Code directly ( for example, a function which copies columns from one table to another table, like in python) in order to avoid looping through rows.

CodePudding user response:

My recommendation is to use a different approach. A DataTable is not a particularly fast object, and looking up columns to set values is slow. Creating a new DataTable to replace Table_1 can be much faster since you can use the DataRowCollection.Add() method to quickly add rows.

Using a Dictionary to convert Table_2 allows for much faster lookup than ElementAt as well.

var joinDict = (from T2 in Table_2.AsEnumerable()
                select new {
                    Date = T2.Field<string>("Date"),
                    ID = T2.Field<int>("ID"),
                    A = T2.Field<double>("A"),
                    B = T2.Field<double>("B"),
                    C = T2.Field<double>("C"),
                })
                .ToDictionary(t2 => (t2.Date, t2.ID));

List<int> ids = Table_2.AsEnumerable().Select(s => s.Field<int>("ID")).Distinct().OrderBy(x => x).ToList();

var ans = Table_1.Clone();
for (int i = 0; i < ids.Count; i  ) {
    ans.Columns.Add($"A_{ids[i]}", typeof(double));
    ans.Columns.Add($"B_{ids[i]}", typeof(double));
    ans.Columns.Add($"C_{ids[i]}", typeof(double));
}

foreach (DataRow row in Table_1.Rows) {
    var newRow = new List<object> { row.Field<string>("Date") };
    foreach (var id in ids) {
        if (joinDict.TryGetValue((row.Field<string>("Date"), id), out var t2))
            newRow.AddRange(new object[] { t2.A, t2.B, t2.C });
        else
            newRow.AddRange(new object[] { 0.0, 0.0, 0.0 });
    }
    ans.Rows.Add(newRow.ToArray());
}
Table_1 = ans;

In testing with 100 days in Table_1 and 500 rows per day in Table_2 75% populated, I get about 128x speedup.

CodePudding user response:

You could use the ToLookup LINQ operator, and create an efficient Lookup<DateTime, DataRow> structure from the contents of the Table_2. This readonly structure will contain one IGrouping<DateTime, DataRow> for each unique date, and each grouping will contain all the DataRows that are associated with this date:

var lookup = Table_2.AsEnumerable().ToLookup(r => r.Field<DateTime>("Date"));

Then for every row of the Table_1 you will be able to find quickly all the associated rows of the Table_2:

foreach (DataRow row1 in Table_1.Rows)
{
    DateTime date = row1.Field<DateTime>("Date");
    IEnumerable<DataRow> group = lookup[date];
    if (group == null) continue;
    foreach (DataRow row2 in group)
    {
        int id = row2.Field<int>("ID");
        row1[$"A_{id}"] = row2.Field<double>("A");
        row1[$"B_{id}"] = row2.Field<double>("B");
        row1[$"C_{id}"] = row2.Field<double>("C");
    }
}

Update: It seems that your performance problem is not associated with joining the two DataTables, but with updating an extremely wide DataTable, that contains hundreds or even thousands of DataColumns. Apparently DataTables are not optimized for scenarios like this. The complexity of updating the Nth column of a DataRow is O(n²). To overcome this problem you can export all the values stored in a DataRow through the ItemArray property, manipulate these values, and finally import them back through the same property. The performance can be improved even further by using the BeginLoadData and EndLoadData methods.

Table_1.BeginLoadData();
foreach (DataRow row1 in Table_1.Rows)
{
    DateTime date = row1.Field<DateTime>("Date");
    IEnumerable<DataRow> group = lookup[date];
    if (group == null) continue;
    object[] values = row1.ItemArray; // Export the raw data
    foreach (DataRow row2 in group)
    {
        int id = row2.Field<int>("ID");
        values[Table_1.Columns[$"A_{id}"].Ordinal] = row2.Field<double>("A");
        values[Table_1.Columns[$"B_{id}"].Ordinal] = row2.Field<double>("B");
        values[Table_1.Columns[$"C_{id}"].Ordinal] = row2.Field<double>("C");
    }
    row1.ItemArray = values; // Import the updated data
}
Table_1.EndLoadData();
  • Related