I am writing a project related to mass data fetching. Currently I am using .NET Framework 4.8 and the Mysql package to start connection and insert data to the database server.
I am going to insert around 400, 000 line/second. I am concern that the SQL connection may become the bottleneck of my program. I would like to know if I create a multitthread connection with the sql and insert the data using a consumer queue, would it be faster and is it worth it(pros and cons)?
In my intuitive thought it would be faster but I am not sure how much performance it can provide with respect of the overhead for threads. I am not a SQL expert so it would be nice if someone could explain the pros and cons of opening multiple connections to a SQL on multiple threads.
CodePudding user response:
Rumors, opinions, hearsay, facts, version-dependent benchmarks, some personal experience, etc...
Multiple threads will improve throughput, but with limits:
- Throughput is capped at about half the theoretical limit. (Your "certain percentage") (This is based on benchmarks from a multi-threaded package; I forget the name; it was a decade ago.)
- Multiple threads will compete with each other over Mutexes and other necessary locking mechanisms.
- As of about 5.7, 64 threads was MySQL's limit for multi-threading; above that, throughput stalled or even dropped. (Source: Many Oracle benchmarks bragging about how much better one version was than the previous.) (Meanwhile, latency for each thread went through the roof.)
- Each thread should batch the data if possible.
Batching:
LOAD DATA
is the fastest way toINSERT
lots of rows from a single thread at a single time. But if you include the cost of writing the file toLOAD
, that may make it effectively slower than batched inserting.- Batched
INSERT
is a close second. But it caps out at "hundreds" of rows, when it hits either some limit or "diminishing returns". - Batched Inserting is 10 times as fast as inserting one row per
INSERT
query. So it (orLOAD DATA
) is worth using for high speed ingestion. (Source: many different timed tests.)
Source of data:
- Some data sources necessarily deliver only one row at a time (eg, sensor data from vehicles every N seconds.) This begs from some intermediate layer to batch the data.
- A discussion of gathering data: http://mysql.rjweb.org/doc.php/staging_table
What happens after loading the data? Surely this is not a write-only-never-read table.
- Normalization is useful for shrinking the disk footprint; this is best done in batches. See Normalization
PARTITIONing
is rarely useful, except for eventual purging of old data. See Partition- A huge 'Fact' table is hard to search; consider building Summary data as you ingest he data: Summary Tables
- It may even be practical to do the above processing, then toss the raw data. It sounds like you might be acquiring a terabyte of data per day.