I want to deploy a spark cluster with YARN cluster manager. This spark cluster needs to read data from an external HDFS filesystem belonging to an existing Hadoop ecosystem that also has its own YARN (However, I am not allowed to use the Hadoop's YARN.)
My Questions are -
- Is it possible to run spark cluster using an independent YARN, while still reading data from an outside HDFS filesystem?
- If yes, Is there any downside or performance penalty to this approach?
- If no, can I run Spark as a standalone cluster, and will there be any performance issue?
Assume both the spark cluster and the Hadoop cluster are running in the same Data Center.
CodePudding user response:
using an independent YARN, while still reading data from an outside HDFS filesystem
Yes. Configure the yarn-site.xml
to the necessary cluster and use full FQDN to refer to external file locations such as hdfs://namenode-external:8020/file/path
any downside or performance penalty to this approach
Yes. All reads will be remote, rather than cluster-local. This would effectively be similar performance degradation as reading from S3 or other remote locations, however.
can I run Spark as a standalone cluster
You could, or you could use Kubernetes, if that's available, but both are pointless IMO, if there's already a YARN cluster (with enough resources) available