I am trying to write a code that makes records in a SQL DB.
The data I'll be inserting will be received real time(at least 10,000 and up to 80,000 per minute) and one crucial thing is that the exact time the data was received is also recorded.
I understand that a code using a single thread dealing with that much data might overload the core so I'm trying to figure out a better way to write a code.
From searching the web, this is what I came up with.
- Have a main thread receive the data and also make record of the time.
- Divide the received data into multiple threads.
- Make those multiple threads insert to a single DB in a batch.
As I am not familiar with codes using multiple threads, I am not sure how to implement this or if this is even OK. If it is not, what would be a better way? If it is, what method or class should I use to execute each step?
CodePudding user response:
This is what Queues can be used for. Have a list in memory that you add to and then have a service that processes the queue in order they are received to ensure sequence is correct.
It can then process them on a separate thread. You can do this on a timed event since you receive them per minute and process them every minute if you want.
If you need a basic example I suppose I could make one, but yeah this is one option.
UPDATE: here is example for auctions that does something similar. This is a basic example for a small scale demo auction application. This too has bottlenecks, but it will ensure all records are inputted as they are received allowing the true winning bid to always go through.
public Queue<AuctionBidQueue> AuctionBidQueueList = new Queue<AuctionBidQueue>();; //create a class for what data is coming in.
public void PopulateAuctionBidQueue()
{
//Can only add auctions from AuctionBidQueue table if queue is not processing.
if(QueueProcessing == false)
{
//Lock the Queue - This is done to prevent another thread or process from processing the queue. In my scenario this could happen, maybe not for you.
LockAuctionBidQueue();
//add AuctionBidQueue to AuctionBidQueueList
List<AuctionBidQueue> auctionBigQueue = new List<AuctionBidQueue>();
try
{
//I call DB here to get bids, but this is where you'd take your 10k-80k records and put them into a tolist
auctionBigQueue = _context.AuctionBidQueue.Where(x => x.Active == true && x.IsProcessed == false).OrderBy(x => x.CreateTimeInTicks).ToList();
} catch(Exception error)
{
Console.WriteLine(error);
return;
}
if(auctionBigQueue.Count != 0)
{
foreach (AuctionBidQueue abq in auctionBigQueue)
{
AuctionBidQueueList.Enqueue(abq); //adds auctions to in memory list.
}
//Process the bid Queue
ProcessAuctionBidQueue();
//Clear the Auction Bid Queue List
ClearAuctionBidQueue();
}
//Unlock the queue to be processed again
UnlockAuctionBidQueue();
}
}
You can then have a task that process the list. In my example bids are processed every 200 ms, so since you only get records everyone 1 minute you could increase the time of the time thread to not run so frequent.
AuctionEngineTimer = new Timer(o => {
using (var scope = _scopeFactory.CreateScope())
{
var dbContext = scope.ServiceProvider.GetRequiredService<ApplicationDbContext>();
AuctionEngine auctionEngine = new AuctionEngine(_hostEnv, _scopeFactory, _hubContext, dbContext);
auctionEngine.PopulateAuctionBidQueue();
}
},
null,
TimeSpan.Zero,
TimeSpan.FromMilliseconds(200));
CodePudding user response:
If your constraints .i.e RDBMS , realtime inserts are taken out you have a better set of choices.
- NoSQL DB geared a. towards heavy writes (and no/less updates) -- Cassandra. b. More read intensive than write - Hbase c. Mongodb / Dynamodb etc...
- For no realtime inserts - queue/bulk ingestion for performance. Also , if you do not need to maintain Read consistency or isolation , NoSQL becomes an easy choice.
If you have to choose an RDBMS you are looking at multiple areas to take care of.
Concurrency on insert points --- partitioning the table and spreading it in the same instance and then distributing those across shards.
Connections to database and total sessions --- choice and configuration of load-balancer.
Type of inserts - long or short (how much data.) - table design
Storage subsystem/ scalability Storage subsystem will determine your latency and throughput. Retention of data will determine your growth and scalability.
Each session to the database will have a memory footprint, so you have to provision your database and hosts accordingly.
Type of workload -- only inserts or mixture of insert/update. SELECT statements (read) will need to be offloaded to replica/Read only standby.
RDBMS will have ACID properties of which D is for durability - it follows WAL (write-ahead protocol) -- will generate transaction logs/ redo logs/ WAL logs/binlog (different names based on the RDBMS you choose)...these are disk hungry files. Management of these is critical for recovery.Also management of these files are not trivial. Heavy inserts will create heavy transactions logs.
Expecting a 24X7 high frequency transactions will limit you from planning schema changes easily , DDL operations like creating indexes and other maintenance work on the table that is taking high frequency transactions. Changes to database gets imperative to launch new features. Most modern RDBMS supports JSON and some other formats for storing data like XML , Hstore. You might like to implement schemaless transactional system. Something like https://eng.uber.com/schemaless-part-one-mysql-datastore/ https://backchannel.org/blog/friendfeed-schemaless-mysql
http://gotocon.com/dl/goto-aar-2014/slides/MartyWeiner_ScalingPinterest.pdf
Fault tolerance features on load-balancers for shards.
Some other teams in the downstream like analytics / datawarehouse will require the data at some duration , say hourly / daily ... The data might be provided by various methods:
- downstream pull.
- publish and downstream subscribe
- CDC capture through WAL/REDO/BIN/Transation logs....