What's the best database for aggregating large (both in count and size) volumes of complex JSON data in real time? for reference, a single JSON object size may vary from 2Mb to 100Mb and include nested arrays with no known structure. the table/collection size can easily reach 10Tb in a month.
Due to the nature of randomized data, we need to aggregate it to extract meaningful data and relationships at the surface elasticsearch was a good candidate however it lacked the necessary aggregation capabilities to change and transform the data's shape as most of the time we needed to flatten nested arrays to extract some value.
We also tried and currently running on MongoDB - optimized to the bone - which to this moment we didn't encounter any problem with transforming data however due to the lack of normalization and large documents size quires often take a lot of time e.g. 1-3 minutes.
For example, the data shape as being sent from clients looks like this:-
{
"metadata": { .... },
"input": { .... },
"output": {
"problems": [{id: 1}, {id: 2} ....]
... multi-level nested JSON (including arrays)
}
}
In order for us to determine all problems for a given client we need to:-
- Find Client Documents based on metadata fields
- Flatten all problems from output
- Merge All problems
- Deduplicate problems
I hope that it's clearly visible that it's not a scaleable or efficient solution
Notes about data:-
- Every document is self-contained and doesn't need any external joins
- All documents of a similar or identical type are in a separate collection
- Relationships are often extracted by combining a single field from all documents
Limitations:-
We'll not be able to preprocess or normalize data as it's needed to be available and calculated in real-time.
I certainly know there is no size fits all, especially with random and unstructured data my aim is to optimize reading/quiring time - writes aren't really that important - not to crazy high speeds, anything from 1-5 seconds is considered super optimal.
CodePudding user response:
This seems like a problem for big data. Have you looked into HDFS, there are a number of different solutions for your problem. I think the root of your problem is that you are trying to find an answer in traditional DBs, where a big data framework could be the best solution.