Home > database >  Limit before sharding or partitioning a table
Limit before sharding or partitioning a table

Time:12-11

I am new to the database system design. After reading many articles, I am really getting confused on what is the limit till which we should have 1 table and not go for sharding or partitioning. I know that it is really hard to provide generic answer and things depend on factors like

  • size of row
  • kind of data (strings, blobs, etc)
  • active queries number
  • what kind of queries
  • indexes
  • read heavy/write heavy
  • the latency expected

But when someone ask that

  • what will you do if you have 1 billion data and million rows getting added everyday. The latency needs to be less than 5 ms for 4 read, 1 write and 2 update queries over such a big database, etc.
  • what will your choice if you have only 10 million rows but the updates and reads are high. The number of new rows added are not significant. High consistency and low latency are the requirement.

If the rows are less that a million and the row size is increasing by thousands then the choice is simple. But it gets trickier when the choice involves for million or billion of rows.

Note: I have not mentioned the latency number in my question. Please answer according to the latency number which is acceptable to you. Also, we are talking about structured data.

I am not sure but I can add 3 specific questions:

  • Lets say that you choose sql database for amazon or any ecommerce order management system. The order numbers are increasing everyday by million. There are already 1 billion record. Now, assuming that there is no archival of data. There are high read queries more than thousand queries per second. And there are writes as well. The read:write ratio is 100:1
  • Let's take an example which smaller number now. Lets say that you choose a sql database for abc or any ecommerce order management system. The order numbers are increasing everyday by thousands. There are already 10 million record. Now, assuming that there is no archival of data. There are high read queries more than ten thousand queries per second. And there are writes as well. The read:write ratio is 10:1
  • 3rd example: Free goodies distribution. We have 10 million goodies to be distributed. 1 goodies per user. High consistency and low latency is the aim. Lets assume that 20 million users already waiting for this free distribution and once the time starts, all of them will try to get the free goodies.

Note: In the whole question, the assumption is that we will go with SQL solutions. Also, please neglect if the provided usecase doesn't make sense logically. The aim is to get the knowledge in terms of numbers.

Can someone please help with what are the benchmarks. Any practical numbers from the project you are currently working in which can tell that for such a big database with these many queries, this is the latency observed,. Anything which can help me justify the choice for the number of tables for the certain number of queries for particular latency.

CodePudding user response:

Some answers for MySQL. Since all databases are limited by disk space, network latency, etc., other engines may be similar.

  • A "point query" (fetching one row using a suitable index) takes milliseconds regardless of the number of rows.
  • It is possible to write a SELECT that will take hours, maybe even days, to run. So you need to understand whether the queries are pathological like this. (I assume this is an example of high "latency".)
  • "Sharding" is needed when you cannot sustain the number of writes needed on a single server.
  • Heavy reads can be scaled 'infinitely' by using replication and sending the reads to Replicas.
  • PARTITIONing (especially in MySQL) has very few uses. More details: Partition
  • INDEXes are very important for performance.
  • For Data Warehouse apps, building and maintaining "Summary tables" is vital for performance at scale. (Some other engines have some built-in tools for such.)
  • INSERTing one million rows per day is not a problem. (Of course, there are schema designs that could make this a problem.) Rules of Thumb: 100/second is probably not a problem; 1000/sec is probably possible; it gets harder after that. More on high speed ingestion
  • Network latency is mostly determined by how close the client and server are. It takes over 200ms to reach the other side of the earth. On the other hand, if the client and server are in the same building, latency is under 1ms. On another hand, if you are referring to how long it takes too run a query, then here are a couple of Rules of Thumb: 10ms for a simple query that needs to hit an HDD disk; 1ms for SSD.
  • UUIDs and hashes are very bad for performance if the data is too big to be cached in RAM.
  • I have not said anything about read:write ratio because I prefer to judge reads and writes independently.
  • "Ten thousand reads per second" is hard to achieve; I suggest that very few apps really need such. Or they can find better ways to achieve the same goals. How fast can one user issue a query? Maybe one per second? How many users can be connected and active at the same time? Hundreds.
  • (my opinion) Most benchmarks are useless. Some benchmarks can show that one system is twice as fast as another. So what? Some benchmarks say that when you have more than a few hundred active connections, throughput stagnates and latency heads toward infinity. So what. After you have an app running for some time, capturing the actual queries is perhaps the best benchmark. But it still has limited uses.
  • Almost always a single table is better than splitting up the table (multiple tables; PARTITIONing; sharding). If you have a concrete example, we can discuss the pros and cons of the table design.
  • Size of row and kinds of data -- Large columns (TEXT/BLOB/JSON) are stored "off-record", thereby leading to [potentially] an extra disk hit. Disk hits are the most costly part of any query.
  • Active queries -- After a few dozen, the queries stumble over each other. (Think about a grocery store with lots of shoppers pushing carts -- with "too many" shoppers, each takes a long time to finish.)

When you get into large databases, they fall into a few different types; each with somewhat different characteristics.

  • Data Warehouse (sensors, logs, etc) -- appending to 'end' of the table; Summary Tables for efficient 'reports'; huge "Fact" table (optionally archived in chunks); certain "dimension tables".
  • Search (products, web pages, etc) -- EAV is problematical; FULLTEXT is often useful.
  • Banking, order processing -- This gets heavy into the ACID features and the need for crafting transactions.
  • Media (images and videos) -- How to store the bulky objects while making searching (etc) reasonably fast.
  • 'Find nearest' -- Need a 2D index, either SPATIAL or some of the techniques here
  • Related