I am fetching JSON data from different API's. I want to store them in HDFS and then use them in MongoDB.
Do I need to convert them to avro, sequence file, parquet, etc., or can I simply store them as plain JSON and load them to the database later?
I know that if i convert them to another format they will get distributed better and compressed, but how will I be able then to upload an avro file to MongoDB? MongoDB only accepts JSON. Should I do another step to read them from avro and convert them to JSON?
CodePudding user response:
How large is the data you're fetching? If it's less than 128MB (with or without compression) per file, it really shouldn't be in HDFS.
To answer the question, format doesn't really matter. You can use SparkSQL to read any Hadoop format (or JSON) to load into Mongo (and vice versa).
Or you can write the data first to Kafka, then use a process such as Kafka Connect to write to both HDFS and Mongo at the same time.