Incremental Data Storage from MongoDB to Amazon S3 in Parquet Format-CodePudding

I have a database in MongoDB. That is updating every minute from various user PCs. Now I want to store the data into Amazon S3 bucket (preferable in Parquet, otherwise CSV). But I do not want to store the full MongoDB data into S3 every-time. I only want to save the incremental data in S3.

I was thinking to use Kafka in between MongoDB and S3. But there are two issues in that:
Issue 1: I do not know how to store the incremental data from MongoDB to S3 in Parquet/CSV format without any paid solution
Issue 2: I do not know, whether this is a good/practical solution

Can anyone suggest any solution to achieve this kind of job please?

CodePudding user response：

parquet will be very advantages when you save huge data, say 10k rows. When you say incremental records, I am thinking you will be saving every minute and at max 1-4k records you may get. Saving as parquet will not be helpful here. Instead

use JSON --advantage being you dont have to worry abt special chars/ encoding, column placements, nested columns etc. Gson parser will take care of all of them. Meaning read from mongo cdc and write to JSON end of every 1minute (Writing at end of minute will make sure that you have 1 fat file instead of 60 continious files). Understand that S3 bills you by the number of file reads you do. Meaning, if you store as 60 files and read 60 files --it will be more costly than reading 1 single fat file
have a snapshot in parquet. keep merging the Jsons to the parquet files using some spark job.

You may alternatively consider delta bricks --i myself have not used it. But the advantage is that you can keep writing to the data store in delta format and this delta bricks will take care of merging the data periodically (using vacuum command) -- and makes sure you get the latest parquet always.

hope this helps

CodePudding user response：

store the incremental data from MongoDB (in Kafka)

Use Debezium or Mongo CDC connector

do not want to store the full MongoDB

You can disable initial Debezium snapshot, but then any update/delete after starting the connector wouldn't be processed correctly.

to S3 in Parquet/CSV format

Mongo documents typically aren't tables, so do not use CSV; Parquet handles nested data much better.

You can use Confluent S3 sink connector for this.

without any paid solution

All the above are free / open source tools. S3 costs money unless you use an alternative solution such as MinIO. BUT, you're still paying for servers for running all these tools, and it'll probably be more costly to recover / maintain than using AWS S3 directly.