Home > database >  Alluxio & Spark
Alluxio & Spark

Time:05-31

I have a question about Alluxio working together with Spark. When spark jobs are launched in a yarn cluster(without Alluxio), spark executors run on the same nodes where the input data blocks are present and this is one of the reasons for the high performance of spark. I am not sure what is the added advantage that Alluxio can provide with Spark in a yarn cluster. From the documentation of Alluxio, it looks to me that Alluxio does the same functionality of caching the file blocks in the node and launches the spark executor in the same node. Why should I use Alluxio with spark and yarn? Can someone help me understand this concept better?

CodePudding user response:

Alluxio will help performance of multiple spark jobs where instead of persisting and reading the data from disks they'd use the alluxio cache

CodePudding user response:

yes you can definitely use Alluxio with Spark on YARN. In this case, you may need to run Alluxio outside YARN --- Alluxio will behave like HDFS to hint Spark on the location of their target data blocks stored on Alluxio worker, to influence Spark data locality. You may refer to a presentation years ago by alluxio PMC -- https://www.alluxio.io/resources/videos/community-office-hour-improving-data-locality-for-spark-jobs-on-kubernetes-using-alluxio/

  • Related