Home > database >  What is the recommended DefaultFS (File system) for Hadoop on ephemeral Dataproc clusters?
What is the recommended DefaultFS (File system) for Hadoop on ephemeral Dataproc clusters?

Time:06-26

What is the recommended DefaultFS (File system) for Hadoop on Dataproc. Are there any benchmarks, considerations available around using GCS vs HDFS as the default file system?

I was also trying to test things out and discovered that when I set the DefaultFS to a gs:// path, the Hive scratch files are getting created - both on HDFS as well as the GCS paths. Is this happening synchronously and adding to latency or does the write to GCS happen after the fact?

Would appreciate any guidance, reference around this. Thank you

PS: These are ephemeral Dataproc clusters that are going to be using GCS for all persistent data.

CodePudding user response:

HDFS is faster. There should already be public benchmarks for that, or just taken as a fact because GCS is networked storage where HDFS is directly mounted in the Dataproc VMs.

"Recommended" would be persistent storage, though, so GCS, but maybe only after finalizing the data in the applications. For example, you might not want Hive scratch files in GCS since they'll never be used outside of the current query session, but you would want Spark checkpoints if you're running periodic batch jobs that scale down the HDFS cluster in between executions

CodePudding user response:

I would say the default (HDFS) is the recommended. Typically, the input and output data of Dataproc jobs are persisted outside of the cluster in GCS or BigQuery, the cluster is used for compute and intermediate data. These intermediate data are stored on local disks directly or through HDFS which eventually also goes to local disks. After the job is done, you can safely delete the cluster, only pay for the storage of input and output data to save cost.

Also HDFS usually has lower latency for intermediate data, especially for lots of small files and metadata operations, e.g. dir rename. GCS is better at throughput for large files.

But when using HDFS, you need to provision sufficient disk space (at least 1TB each node) and consider using local SSDs. See https://cloud.google.com/dataproc/docs/support/spark-job-tuning#optimize_disk_size for more details.

  • Related