Home > Back-end >  How to manage JSON in Hadoop HDFS
How to manage JSON in Hadoop HDFS

Time:11-16

How does Hadoop HDFS manage JSON files?

Assuming that some JSON files are stored into the HDFS and that each of these JSONs is different from the others, I would like to output a JSON created through a query. Like MongoDB.

For example, I show you this pseudo-code:

FOR EACH json IN hdfs:
    name = json.NAME
    IF json HAS this_attribute:
        x = json.this_attribute.value

CREATE A CSV THAT CONTAINS ALL INFO REQUIRED (name, x)

RETURN CSV

In MongoDB, to produce this output is most easy. But I'm into a project where to use MongoDB is not possible. S

CodePudding user response:

I think the easiest tool for you to use with HDFS is spark. It give you lots of rich tooling including a lot of rich file formats that perform much better than storing text information as Text, CSV or JSON. I suggest that you investigate using file formats that are not text when working with Big Data.

CodePudding user response:

Hadoop/HDFS doesn't "manage" any files beyond placing their blocks.

If you want to run queries against the data, you need to use a tool like Hive, Presto, Drill, Spark, Flink, etc, or you can alter your file upload procedure to write to HBase.

each of these JSONs is different from the others

Most query tools prefer semi-structured data, so having the keys of the JSON being similar would be best. If you really need to store random JSON objects, then using a document database like Mongo would be preferred. (Hadoop is not an alternative to Mongo)

  • Related