I am looking for a way to improve the performance of C# code and would appreciate any help.
There are 2 tables: Table_1 and Table_2, and I want to collect data from Table_2 and save in Table_1 in the following form:
Table_1
Date | Some_Stat |
---|---|
2021-06-01 | 23.7 |
2021-06-02 | 12.6 |
2021-06-03 | 47.9 |
Table_2
Date | ID | A | B | C |
---|---|---|---|---|
2021-06-02 | 4 | 21 | 23 | 13 |
2021-06-02 | 3 | 67 | 31 | 25 |
2021-06-01 | 3 | 45 | 54 | 33 |
2021-06-03 | 3 | 71 | 28 | 51 |
2021-06-03 | 4 | 26 | 83 | 24 |
My goal: Table_1 after join
Date | Some_Stat | A_3 | B_3 | C_3 | A_4 | B_4 | C_4 |
---|---|---|---|---|---|---|---|
2021-06-01 | 23.7 | 45 | 54 | 33 | 0 | 0 | 0 |
2021-06-02 | 12.6 | 67 | 31 | 25 | 21 | 23 | 13 |
2021-06-03 | 47.9 | 71 | 28 | 51 | 26 | 83 | 24 |
In order to achieve this I used the code below, which works, but is very slow, especially when there are thousands of IDs. Basically what the code does is making first the required join and transferring LINQ join results to existing Table_1 (copies the columns). I have checked the performance time and the LINQ query is always very fast (time: 0ms), but the data transfer is the problem.
List<int> ids = Table_2.AsEnumerable().Select(s => s.Field<int>("ID").Distinct().ToList();
for (int i = 0; ind < ids.Count; i )
{
Table_1.Columns.Add($"A_{ids[i]}", typeof(double));
Table_1.Columns.Add($"B_{ids[i]}", typeof(double));
Table_1.Columns.Add($"C_{ids[i]}", typeof(double));
}
for (int i = 0; ind < ids.Count; i )
{
// LINQ join (fast)
var joinedTables = from T1 in Table_1.AsEnumerable()
join T2 in Table_2.Select($"ID = {ids[i]}").AsEnumerable()
on (String)T1["Date"] equals (String)T2["Date"]
into T1_and_T2
from TT in T1_and_T2.DefaultIfEmpty()
select new
{
Date = (String)T1["Date"],
A = TT != null ? (double)TT["A"] : 0.0,
B = TT != null ? (double)TT["B"] : 0.0,
C = TT != null ? (double)TT["C"] : 0.0,
};
// data transfer (very slow)
for (int day = 0; day < joinedTables.Count(); day )
{
Table_1.Rows[day][$"A_{ids[i]}"] = joinedTables.ElementAt(day).A;
Table_1.Rows[day][$"B_{ids[i]}"] = joinedTables.ElementAt(day).B;
Table_1.Rows[day][$"C_{ids[i]}"] = joinedTables.ElementAt(day).C;
}
}
Also instead of the data transfer version above I have tried another way, but it is as slow as the previous:
int day = 0;
foreach( var row in joinedTables)
{
Table_1.Rows[day][$"A_{ids[i]}"] = row.A;
Table_1.Rows[day][$"B_{ids[i]}"] = row.B;
Table_1.Rows[day][$"C_{ids[i]}"] = row.C;
}
Note: I am also open to new approaches on how to collect data from Table_2 in Table_1. There could be a way to use build-in functions (written in C or C ) which will access Machine Code directly ( for example, a function which copies columns from one table to another table, like in python) in order to avoid looping through rows.
CodePudding user response:
My recommendation is to use a different approach. A DataTable
is not a particularly fast object, and looking up columns to set values is slow. Creating a new DataTable
to replace Table_1
can be much faster since you can use the DataRowCollection.Add()
method to quickly add rows.
Using a Dictionary
to convert Table_2
allows for much faster lookup than ElementAt
as well.
var joinDict = (from T2 in Table_2.AsEnumerable()
select new {
Date = T2.Field<string>("Date"),
ID = T2.Field<int>("ID"),
A = T2.Field<double>("A"),
B = T2.Field<double>("B"),
C = T2.Field<double>("C"),
})
.ToDictionary(t2 => (t2.Date, t2.ID));
List<int> ids = Table_2.AsEnumerable().Select(s => s.Field<int>("ID")).Distinct().OrderBy(x => x).ToList();
var ans = Table_1.Clone();
for (int i = 0; i < ids.Count; i ) {
ans.Columns.Add($"A_{ids[i]}", typeof(double));
ans.Columns.Add($"B_{ids[i]}", typeof(double));
ans.Columns.Add($"C_{ids[i]}", typeof(double));
}
foreach (DataRow row in Table_1.Rows) {
var newRow = new List<object> { row.Field<string>("Date") };
foreach (var id in ids) {
if (joinDict.TryGetValue((row.Field<string>("Date"), id), out var t2))
newRow.AddRange(new object[] { t2.A, t2.B, t2.C });
else
newRow.AddRange(new object[] { 0.0, 0.0, 0.0 });
}
ans.Rows.Add(newRow.ToArray());
}
Table_1 = ans;
In testing with 100 days in Table_1
and 500 rows per day in Table_2
75% populated, I get about 128x speedup.
CodePudding user response:
You could use the ToLookup
LINQ operator, and create an efficient Lookup<DateTime, DataRow>
structure from the contents of the Table_2
. This readonly structure will contain one IGrouping<DateTime, DataRow>
for each unique date, and each grouping will contain all the DataRow
s that are associated with this date:
var lookup = Table_2.AsEnumerable().ToLookup(r => r.Field<DateTime>("Date"));
Then for every row of the Table_1
you will be able to find quickly all the associated rows of the Table_2
:
foreach (DataRow row1 in Table_1.Rows)
{
DateTime date = row1.Field<DateTime>("Date");
IEnumerable<DataRow> group = lookup[date];
if (group == null) continue;
foreach (DataRow row2 in group)
{
int id = row2.Field<int>("ID");
row1[$"A_{id}"] = row2.Field<double>("A");
row1[$"B_{id}"] = row2.Field<double>("B");
row1[$"C_{id}"] = row2.Field<double>("C");
}
}
Update: It seems that your performance problem is not associated with joining the two DataTable
s, but with updating an extremely wide DataTable
, that contains hundreds or even thousands of DataColumn
s. Apparently DataTable
s are not optimized for scenarios like this. The complexity of updating the Nth column of a DataRow
is O(n²). To overcome this problem you can export all the values stored in a DataRow
through the ItemArray
property, manipulate these values, and finally import them back through the same property. The performance can be improved even further by using the BeginLoadData
and EndLoadData
methods.
Table_1.BeginLoadData();
foreach (DataRow row1 in Table_1.Rows)
{
DateTime date = row1.Field<DateTime>("Date");
IEnumerable<DataRow> group = lookup[date];
if (group == null) continue;
object[] values = row1.ItemArray; // Export the raw data
foreach (DataRow row2 in group)
{
int id = row2.Field<int>("ID");
values[Table_1.Columns[$"A_{id}"].Ordinal] = row2.Field<double>("A");
values[Table_1.Columns[$"B_{id}"].Ordinal] = row2.Field<double>("B");
values[Table_1.Columns[$"C_{id}"].Ordinal] = row2.Field<double>("C");
}
row1.ItemArray = values; // Import the updated data
}
Table_1.EndLoadData();