Home > Enterprise >  Importing a 1.3GB CSV file into sqlite via EF Core
Importing a 1.3GB CSV file into sqlite via EF Core

Time:12-04

The CSV file

I have a CSV file that's about 1.3 GB in size:

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-a----         10/4/2021   1:23 PM     1397998768 XBTUSD.csv

This is the complete list of trading data for Bitcoin on the Kraken exchange.

Here's what the data in the CSV looks like:

> Get-Content .\XBTUSD.csv | Select-Object -First 10
1381095255,122.00000,0.10000000
1381179030,123.61000,0.10000000
1381201115,123.91000,1.00000000
1381201115,123.90000,0.99160000
1381210004,124.19000,1.00000000
1381210004,124.18000,1.00000000
1381311039,124.01687,1.00000000
1381311093,124.01687,1.00000000
1381311094,123.84000,0.82300000
1381431835,125.85000,1.00000000

More information about the file is available here:

https://support.kraken.com/hc/en-us/articles/360047543791-Downloadable-historical-market-data-time-and-sales

The file can be downloaded from here:

https://drive.google.com/drive/folders/1jI3mZvrPbInNAEaIOoMbWvFfgRDZ44TT

See the file XBT.zip. Inside that archive is XBTUSD.csv.

Baseline test - importing directly into sqlite

If I create the following table in sqlite:

CREATE TABLE CsvTrades (
    "TimeStamp" TEXT NOT NULL,
    "Price"     TEXT NOT NULL,
    "Volume"    TEXT NOT NULL
);

and run the following to import the CSV (as well as time how long it takes):

$a = Get-Date

sqlite3.exe .\kraken-trades.db -cmd '.mode csv' '.import C:/Users/dharm/XBTUSD.csv CsvTrades'

$b = Get-Date

($b - $a).TotalMinutes

I get the following:

1.56595191666667

1.5 minutes. Not bad!

Using EF Core

In the code below, I'm using the CsvHelper package:

https://joshclose.github.io/CsvHelper/getting-started/

Here's a class for the CSV file rows:

public class CsvRow
{
    [CsvHelper.Configuration.Attributes.Index(0)]
    public long TimeStamp { get; set; }

    [CsvHelper.Configuration.Attributes.Index(1)]
    public decimal Price { get; set; }

    [CsvHelper.Configuration.Attributes.Index(2)]
    public decimal Quantity { get; set; }
}

Here's a class for the Trade entity:

[Index(nameof(TimeStamp))]
public class Trade
{
    public int Id { get; set; }
    public decimal Price { get; set; }
    public decimal Quantity { get; set; }
    public DateTime TimeStamp { get; set; }
}

The DbContext is straightforward:

public class AppContext : DbContext
{
    public DbSet<Trade> Trades { get; set; }

    protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
    {
        var folder = Environment.SpecialFolder.LocalApplicationData;

        var path = Environment.GetFolderPath(folder);

        var db_path = $"{path}{System.IO.Path.DirectorySeparatorChar}kraken-trades.db";
                
        optionsBuilder.UseSqlite($"Data Source={db_path}");
    }
}

And finally, the function that performs the import:

void initialize_from_csv()
{
    var config = new CsvConfiguration(CultureInfo.InvariantCulture)
    {
        HasHeaderRecord = false
    };

    using (var reader = new StreamReader(@"C:\Users\dharm\XBTUSD.csv"))
    using (var csv = new CsvReader(reader, config))
    {
        var records = csv.GetRecords<CsvRow>().Select(row => new Trade()
        {
            Price = row.Price,
            Quantity = row.Quantity,
            TimeStamp = DateTimeOffset.FromUnixTimeSeconds(row.TimeStamp).UtcDateTime
        });

        using (var db = new AppContext())
        {
            Console.WriteLine(DateTime.Now);
                        
            while (true)
            {
                //var items = records.Take(10_000).ToList();

                var items = records.Take(100_000).ToList();

                if (items.Any() == false) break;

                Console.WriteLine("{0:yyyy-MM-dd}", items[0].TimeStamp);

                db.AddRange(items);
                db.SaveChanges();
            }

            Console.WriteLine(DateTime.Now);
        }
    }
}

The question

When I let this run, it does indeed proceed to add the items to the database. However, it's pretty slow; I haven't timed it to completion but I could see it taking over an hour.

Is there a good way to make this faster while still using EF Core?

Notes

The code referenced above is available here in a single file:

https://github.com/dharmatech/kraken-trades-database/blob/003-minimal/KrakenTradesDatabase/Program.cs

It's a .NET 6 project. Let me know if you have any issues building and running it.

Timing

I added some code to time the batch adds. It looks like it's taking around 7 seconds per 100,000 records.

Starting batch at 2013-10-06. Batch took 00:00:08.7689932.
Starting batch at 2015-12-08. Batch took 00:00:06.7453421.
Starting batch at 2016-04-19. Batch took 00:00:06.7833506.
Starting batch at 2016-06-25. Batch took 00:00:06.7083806.
Starting batch at 2016-08-22. Batch took 00:00:06.7826717.
Starting batch at 2016-11-20. Batch took 00:00:06.4212123.

wc says there are 41,695,261 lines:

$ wc -l XBTUSD.csv
41695261 XBTUSD.csv

So at that rate, it would take around 48 minutes.

Why EF Core?

Some folks have asked, why use EF Core for this? Why not just a direct import?

The example above was intentionally simplified to focus on the speed of the import.

I have more elaborate versions where there are relationships with other entities. In that case:

  • Setting up the other tables and foreign key properties is more straightforward with EF Core.

  • I can more easily switch between database backends (SQL Server, PostgreSQL, sqlite).

See for example this branch where multiple symbols are imported. There's a relationship between the Trade and Symbol there. There could be other relationships as well.

https://github.com/dharmatech/kraken-trades-database/blob/006/KrakenTradesDatabase/Program.cs

CodePudding user response:

EFCore.BulkExtensions

Using the following:

https://github.com/borisdj/EFCore.BulkExtensions

and then changing this line:

db.AddRange(items);

to:

db.BulkInsert(items);

makes the import go from 48 minutes to 5.7 minutes.

This version of the project is available here:

https://github.com/dharmatech/kraken-trades-database/blob/004-bulk-extensions/KrakenTradesDatabase/Program.cs

Thanks

Thanks to Caius Jard who suggested EFCore.BulkExtensions in a comment above.

CodePudding user response:

Here is a complete C# (10.0) program that inserts the CSV data faster than the sqlite3 tool. It uses my Sylvan.Data.Csv library, which is the fastest CSV parser for .NET.

On my machine sqlite3 will insert the data in 1:07.6, and my code inserts in 1:02.9.

While this doesn't meet your requirement of "still using EFCore", I think the difference in performance speaks for itself.

Packages:

<PackageReference Include="Sylvan.Data.Csv" Version="1.1.9" />
<PackageReference Include="System.Data.SQLite" Version="1.0.115.5" />

Code:

using System.Collections.ObjectModel;
using System.Data.Common;
using System.Data.SQLite;
using System.Diagnostics;
using Sylvan.Data.Csv;

var sw = Stopwatch.StartNew();

var conn = new SQLiteConnection("Data Source=test.db");
conn.Open();

var data = CsvDataReader.Create("xbtusd.csv", new CsvDataReaderOptions { HasHeaders = false });


// create the target table
{
    using var cmd = conn.CreateCommand();
    var tbl = "create table CsvTrades (TimeStamp TEXT NOT NULL, Price TEXT NOT NULL, Volume TEXT NOT NULL)";
    cmd.CommandText = tbl;
    cmd.ExecuteNonQuery();
}

// get the schema for the target table.
ReadOnlyCollection<DbColumn> ss;
{
    using var cmd = conn.CreateCommand();
    cmd.CommandText = "select * from CsvTrades limit 0;";
    var r = cmd.ExecuteReader();
    ss = r.GetColumnSchema();
}

// create the parameterized insert command
var cmdW = new StringWriter();
cmdW.Write("insert into CsvTrades values(");
int i = 0;
foreach (var c in ss)
{
    if (i > 0)
        cmdW.Write(",");
    cmdW.Write("$p"   i  );
}

cmdW.Write(");");
var cmdt = cmdW.ToString();

// insert CSV data.
using (var tx = conn.BeginTransaction())
{
    var cmd = conn.CreateCommand();
    cmd.CommandText = cmdt;
    for (i = 0; i < data.FieldCount; i  )
    {
        var p = cmd.CreateParameter();
        p.ParameterName = "$p"   i;
        cmd.Parameters.Add(p);
    }
    cmd.Prepare();
    while (data.Read())
    {
        for (i = 0; i < data.FieldCount; i  )
        {
            cmd.Parameters[i].Value = data.GetValue(i);
        }
        cmd.ExecuteNonQuery();
    }

    tx.Commit();
}

sw.Stop();
Console.WriteLine($"Inserted {data.RowNumber} records in {sw.Elapsed}");

UPDATE: I realized that my code isn't doing everything it needs to: it wasn't converting the unix seconds to date time. Modifying the insert loop as follows:

    while (data.Read())
    {
        cmd.Parameters[0].Value = DateTime.UnixEpoch.AddSeconds(data.GetInt64(0));
        cmd.Parameters[1].Value = data.GetString(1);
        cmd.Parameters[2].Value = data.GetString(2);
        cmd.ExecuteNonQuery();
    }

This slows things down to 1:17.5, which is marginally slower than sqlite3, but then the sqlite3 insert isn't doing the data conversion so you'll end up with the original integer (long) value.

  • Related