Do I need NoSQL solution like Elasticsearch if I want to query 2 billion rows in one table or MySQL-CodePudding

I have application that will go to production soon. Application will insert approximately 10 million rows in one table that has 16 columns in Amazon Aurora MySQL database. Column types are bigint, int, bit, datetime. Other tables in database have less than 3 thousand rows. We will make SQL queries with few inner joins and where clause will only have datetime range and bigint value of one column on that large table in last 6 months. That means we will have 2 billion rows in that large table.
Data older than 6 months will be deleted from that large table. If I put index on date column it will probably slow insert command and maybe querying will still be slow. If I use Elasticsearch and create some application task that will insert rows into multiple shards grouped by date every 5 minutes and create MySQL database backups every 3 days and delete rows older than 3 days, maybe querying data will be faster. What do you think?
Is it better and more efficient to use Elasticsearch or MySQL is enough?

CodePudding user response：

It depends on the usage you are giving to your data (analysis or log handling are two completely different purposes) and how you access it.

What are you worried about?

Query times: indexing the column that will be more queried will increase the speed. Also, designing this table with a snowflake shceme may help. There are other things like picking a column-oriented DBMS that could improve the speeds. Also, depending on the type of data you are storing (maybe you are using dates, so Timescale could help). If this is all about logging, then Elasticsearch can do this pretty nicely, but it seems like you already know what data will fit.
Storage issues: if you are worried about really large tables, there are database systems that are distributed (eg. VoltDB). Which has redundancy as well.
The last thing I can think of is how it will fit in your architecture. If you are already using MySQL, you don't really need for 10 million rows to change a lot of things. The above answer is in case you start increasing the size, and you need some specific solutions.

CodePudding user response：

If MySQL only...

1 billion or 10 million -- MySQL can handle either. But there are limitations on what queries will run "fast".

I need to see a specific query.

Purging old data -- Plan on PARTITIONing by the datetime column. Otherwise, the Delete will be terribly slow. See Partition For "6 months", I recommend 8 monthly or 29 weekly partitions. See the link for a discussion of the starting and "future" partitions.

DATE / DATETIME

It is usually unwise to have an index stating with the datatime. I need to see your query to go into more detail on what the index should look like.

A DATE column is equivalent to a DATETIME for midnight of that morning. I recommend the following pattern for testing against a date or datetime range. It avoids leap issues, midnight issues, etc, etc. This correctly tests for a one-week range:

WHERE d >= '2022-04-24'
  AND d  < '2022-04-24'   INTERVAL 7 DAY

You want a 1 day range?

WHERE d >= '2022-04-24'
  AND d  < '2022-04-24'   INTERVAL 1 DAY

Or noon to noon the next day:

WHERE d >= '2022-04-24 12:00:00'
  AND d  < '2022-04-24 12:00:00'   INTERVAL 24 HOUR

Those work for DATE, DATETIME, TIMESTAMP -- with or without fractional seconds.

Extra indexes are not a big deal. By "10M rows, did you mean 10M/day? That is 120/second (or more during spikes). That is "moderate", assuming SSD drives are being used.

Is your INSERT application single-threaded? Can it batch, say, 100 rows at a time?

If latitude/longitude are involved, say so; that is a different kettle of fish.

Will 2 billion rows slow down Inserts? I need to see the tentative list of indexes (PRIMARY, UNIQUE, SPATIAL, FULLTEXT, and others). Properly designed, I don't see a problem.

Normalization

You should normalize the Fact table to help with disk space (hence speed), but don't over-normalize. To advise further, I need a feel for the data, not just the datatypes. Do not normalize datetime columns or any other columns to be tested as a "range"; such values need to be in the Fact table.

What you have said so far does not indicate the need for sharding (in a MySQL-only implementation).

(I cannot address whether ElasticSearch would be better or worse that MySQL. NoSQL requires re-inventing much of what MySQL or ES do automatically.)