Design Consideration - Choosing Mongo DB as a Blob Store-CodePudding

As I understand Mongo DB supports Blob storage, However from System Design Perspective Is Mongo DB a good choice to store Blobs (Like JPEGs, Doc, and Video) files?

What could be the technical pros and cons of using Mongo DB as a blob store?

CodePudding user response：

In an ideal world i would prefer to store the files/Blobs in a filesystem and store the reference to it on the DB, if there is still a pressing need to store the object in MongoDB only :) then we can use something like a GridFS

https://www.mongodb.com/docs/manual/core/gridfs/

which allows us to store objects which are more than 16 MB default MongoDB storage size.

This would still have limitations on how i want to spread the files across the filesystem, i havent done it ever on a production scale system, but i feel the maintenance scaling/sharding needs to be managed(if not on Atlas). Pros could be quick retrieval and a reduced lookup to the filesystem/s3 etc based on reference in a DB, Cons it might just not be as scalable solution when it comes to high volumes.

Also not too sure if we can have a TTL on the objects in GridFS, definitely not a lifecycle policy when compared to S3

CodePudding user response：

These days we have a lot of possible choices regarding the best solutions that might fit our storage needs, according to the use case. But, it might get to some real specifics regarding which one is the actual better one.

As you were asking this question design-wise, I must say that many times, but definitely not for all of them, using an object storage solution for storing Blobs might be preferred for various use-cases (considering modern cloud and storage solutions, in my examples here I'll be mentioning AWS & S3 specifically).
We can relate to an Object Storage Solution as the "Blob version of key-value database", which can also contain metadata chunks about the object.

A quick recap of direct comparison for specific subject between the two, alongside mentioning the possible advantages and disadvantages of each:

Object Storage (S3)

Pricing - Cheaper out of the box, especially for relatively small amounts of data – competitive pricing (mainly because of its native large scale). Extended detailing about S3 pricing can be found here: https://aws.amazon.com/s3/pricing/
Redundancy & resiliency: Relatively resilient and redundant enough for many (starting and growing) use-cases out of the box (99.999999999% durability), and includes native CRR (Cross-Region Replication - https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication.html)
Scalability – with scaling abilities to handle big, possibly massive amounts of data with ease (no action needed usually, besides more specific use-cases of exceptional amounts of data, to improve performance), it seems S3 wins this one in terms of possible attached costs as-well.
Extended services – from the ability to mount S3 as a File System using S3FS, and to easily attaching a CDN to S3 Buckets using CloudFront as a native service (or other external CDN solutions with relatively significant ease) it seems S3 has the upper-hand for this one.
Lifecycle Policies – with a vast amount of S3 Lifecycle Policies available (S3 object lifecycle policies are listed here: https://aws.amazon.com/s3/storage-classes/), the ability to improve cost optimization and possible durability rates of the chosen data, gives you more optimization and customization options to choose from, or to enable automatic transition for.
Compute & Memory resources – will scale automatically behind the scenes as needed when using S3, without extra costs or needed action (S3's pricing includes resources).
Indexing – there are no dedicated tools for indexing Objects stored in S3. You will have to build or adopt an available solution.

MongoDB GridFS (dedicated to Blob storage)

Pricing - May get significantly more expensive quickly – Data Replication or Data Sharding throughout the cluster/s might make that happen even sooner, pretty quickly. Extended detailing about MongoDB Atlas (managed MongoDB service) pricing can be found here: https://www.mongodb.com/pricing
Redundancy & resiliency – While MongoDB is great at handling scaling, and specific scaling scenario needs, you will still have to replicate your files, or add replication to the shards of the cluster, to consistently maintain these needs. These actions might quickly add to your costs, cluster management & maintenance overhead (managed MongoDB service might ease over this one, but will still add to the costs) and finally – as the amount of data replications grow, it will add up more quickly.
Scalability – without a managed service, you will have to scale on your own (adding nodes, handling networking, applying replica sets, sharding, etc..). Besides the initial scaling, it might and probably will make the cluster management more demanding as scale grows.
Extended services – Native service integration rely on 3rd party integrations or features developed by MongoDB, it seems that currently S3 has a more vast toolset of possible options for these.
Lifecycle Policies – with TTL options available for collections in MongoDB, you get the functionality to define the amount of time for data to exist in your MongoDB cluster, but currently not much more than that.
Compute & Memory resources – As the amount of data grows, and scaling needs get higher, compute power (CPU) and memory (RAM) needs might also grow, which will demand you to add to these, possibly for higher costs.
Indexing – Indexing in MongoDB comes as an easy-to-apply feature out of the box. Apply an index to the collection of your choice with the dedicated built-in tooling – a big advantage for MongoDB.
More details:
https://www.mongodb.com/docs/manual/indexes/
https://www.mongodb.com/docs/atlas/atlas-ui/indexes/

Lastly, no matter the solution you choose, please consider the possible high bandwidth for such functionality, and carefully examine the potential Data Transfer costs that might apply. More details about AWS’s Data Transfer costs:
https://aws.amazon.com/ec2/pricing/on-demand/
https://aws.amazon.com/lambda/pricing/
https://aws.amazon.com/blogs/architecture/overview-of-data-transfer-costs-for-common-architectures/
https://aws.amazon.com/blogs/apn/aws-data-transfer-charges-for-server-and-serverless-architectures/

With all that said, and as mentioned at the start of this answer, each and every case to its own.
For some use cases, an Object Storage might fit better (File System and Block Storage options are available options as-well of course), and in some cases, MongoDB will be the better excelling choice.
Add the discussed two to your system-design toolset box, as available options - and consider using the right one for the given scenario, after examining the different considerations of advantages or disadvantages, that might vary from case to case.

Not one solution fits all.