Home > Net >  Is it worth to open multiple database connections for sql insertion
Is it worth to open multiple database connections for sql insertion

Time:06-29

I am writing a project related to mass data fetching. Currently I am using .NET Framework 4.8 and the Mysql package to start connection and insert data to the database server.

I am going to insert around 400, 000 line/second. I am concern that the SQL connection may become the bottleneck of my program. I would like to know if I create a multitthread connection with the sql and insert the data using a consumer queue, would it be faster and is it worth it(pros and cons)?

In my intuitive thought it would be faster but I am not sure how much performance it can provide with respect of the overhead for threads. I am not a SQL expert so it would be nice if someone could explain the pros and cons of opening multiple connections to a SQL on multiple threads.

CodePudding user response:

Rumors, opinions, hearsay, facts, version-dependent benchmarks, some personal experience, etc...

Multiple threads will improve throughput, but with limits:

  • Throughput is capped at about half the theoretical limit. (Your "certain percentage") (This is based on benchmarks from a multi-threaded package; I forget the name; it was a decade ago.)
  • Multiple threads will compete with each other over Mutexes and other necessary locking mechanisms.
  • As of about 5.7, 64 threads was MySQL's limit for multi-threading; above that, throughput stalled or even dropped. (Source: Many Oracle benchmarks bragging about how much better one version was than the previous.) (Meanwhile, latency for each thread went through the roof.)
  • Each thread should batch the data if possible.

Batching:

  • LOAD DATA is the fastest way to INSERT lots of rows from a single thread at a single time. But if you include the cost of writing the file to LOAD, that may make it effectively slower than batched inserting.
  • Batched INSERT is a close second. But it caps out at "hundreds" of rows, when it hits either some limit or "diminishing returns".
  • Batched Inserting is 10 times as fast as inserting one row per INSERT query. So it (or LOAD DATA) is worth using for high speed ingestion. (Source: many different timed tests.)

Source of data:

  • Some data sources necessarily deliver only one row at a time (eg, sensor data from vehicles every N seconds.) This begs from some intermediate layer to batch the data.
  • A discussion of gathering data: http://mysql.rjweb.org/doc.php/staging_table

What happens after loading the data? Surely this is not a write-only-never-read table.

  • Normalization is useful for shrinking the disk footprint; this is best done in batches. See Normalization
  • PARTITIONing is rarely useful, except for eventual purging of old data. See Partition
  • A huge 'Fact' table is hard to search; consider building Summary data as you ingest he data: Summary Tables
  • It may even be practical to do the above processing, then toss the raw data. It sounds like you might be acquiring a terabyte of data per day.
  • Related