Home > Back-end >  EMR, Spark: proper place for a local shared cache
EMR, Spark: proper place for a local shared cache

Time:08-10

In our Spark application, we store the local application cache in /mnt/yarn/app-cache/ directory, which is shared between app containers on the same ec2 instance

/mnt/... is chosen because it is a fast NVMe SSD on r5d instances

This approach worked well for several years on EMR 5.x - /mnt/yarn belongs to the yarn user, and apps containers run from yarn, and it can create directories

In EMR 6.x things changed - containers now run from the hadoop user which does not have write access to /mnt/yarn/

hadoop user can create directories in /mnt/, but yarn can not, and I want to keep compatibility - the app should be able to run successfully on both EMR 5.x and 6.x

java.io.tmpdir also doesn't work - it is different for each container

What should be the proper place to store cache on NVMe SSD (/mnt, /mnt1) so it can be accessible by all containers and can be operable on both EMR 5.x and 6.x?

CodePudding user response:

On your EMR cluster, you can add the yarn user to the super user group; by default, this group is called supergroup. You can confirm if this is the right group by checking the dfs.permissions.superusergroup in the hdfs-site.xml file.

You could also try modifying the following HDFS properties (in the file named above): dfs.permissions.enabled or dfs.datanode.data.dir.perm.

  • Related