I have a database in MongoDB. That is updating every minute from various user PCs. Now I want to store the data into Amazon S3 bucket (preferable in Parquet, otherwise CSV). But I do not want to store the full MongoDB data into S3 every-time. I only want to save the incremental data in S3.
I was thinking to use Kafka in between MongoDB and S3. But there are two issues in that:
Issue 1: I do not know how to store the incremental data from MongoDB to S3 in Parquet/CSV format without any paid solution
Issue 2: I do not know, whether this is a good/practical solution
Can anyone suggest any solution to achieve this kind of job please?
CodePudding user response:
parquet will be very advantages when you save huge data, say 10k rows. When you say incremental records, I am thinking you will be saving every minute and at max 1-4k records you may get. Saving as parquet will not be helpful here. Instead
use JSON --advantage being you dont have to worry abt special chars/ encoding, column placements, nested columns etc. Gson parser will take care of all of them. Meaning read from mongo cdc and write to JSON end of every 1minute (Writing at end of minute will make sure that you have 1 fat file instead of 60 continious files). Understand that S3 bills you by the number of file reads you do. Meaning, if you store as 60 files and read 60 files --it will be more costly than reading 1 single fat file
have a snapshot in parquet. keep merging the Jsons to the parquet files using some spark job.
You may alternatively consider delta bricks --i myself have not used it. But the advantage is that you can keep writing to the data store in delta format and this delta bricks will take care of merging the data periodically (using vacuum command) -- and makes sure you get the latest parquet always.
hope this helps