We have data in the range of 100 TeraBytes. We’d need to run pre-defined reports on this data (Not a lot of Adhoc reporting). For the pre-defined reports, the expected turn around time is 10s seconds, should not be in minutes.
We are all AWS server-less so far. The original data source is DynamoDB. What toolset should we use? Is redshift server-less an overkill for this? I hear data should be atleast PB scale to use it.
The 100 Terabytes of data are expected to accumulate in the period of next 5 years.
I have been leaning towards streaming data to S3 and using Athena for querying but the unpredictability of query performance is making us anxious.
Is redshift/ redshift serverless/ spectrum an overkill for this use case?
Any better toolset that I should explore?
CodePudding user response:
Given this light query load but large data and S3 storage of the data I'd go with Redshift serverless with the large data located in S3 and accessed via spectrum. This will give you the full functionality of Redshift but not paying for a cluster up full time.
CodePudding user response:
If you can stream data to S3 and your pre-defined report aggregate on a certain time window (like hour/day/week), I would suggest using a database that creates rollups or materialized views so your query will run fast (that is one of your requirements). With rollups or materialized views, you are looking at a much smaller table to query, saving you money on storage and on compute resources during query time. At the same time, you might want to pay attention to the ease of use and cost of creating rollups or materialized views.
Granted I don't know what your data looks like, but being able to handle things like nested JSON which is common with DynamoDB is probably another important requirement. Some systems can handle nested JSON out of the box and some will require ETL tooling to 'flatten' the data.
Here are a few I know that can do rollups and/or materialized views:
- Rockset
- Druid
- Elasticsearch
- Clickhouse
Good luck!