Home > Software engineering >  AWS EMR: Does master node stores hdfs data in EMR cluster?
AWS EMR: Does master node stores hdfs data in EMR cluster?

Time:03-29

master node - does this node stores hdfs data in aws emr cluster? task node - if this node does not store hdfs data, is it purely computational node? in this case does hadoop transfer to task node? does this not defeat data localization computation advantgae?

CodePudding user response:

(Other than the edge case of a master-only cluster with no core or task instances...)

The master instance does not store any HDFS data, nor does it act as a computational node. The master instance runs services like the YARN ResourceManager and HDFS NameNode.

The only nodes that store data are those that run HDFS DataNode, which are only the core instances.

The core and task instances both run YARN NodeManager and thus are the "computational nodes".

Regarding your question, "in this case does hadoop transfer to task node", I assume that you are asking whether or not Hadoop transfers (HDFS) data to the task instances so that they may perform computations on HDFS data. In a sense, yes, task instances may read HDFS blocks remotely from core instances where the blocks are stored.

It's true that this means that task instances can never take advantage of data locality for HDFS data, but there are many cases where this does not matter anyway, such as for tasks that are read shuffle data from other nodes, or tasks that are reading data from remote storage anyway (e.g., Amazon S3). Furthermore, depending upon the core instance type being used, keep in mind that even the HDFS blocks might be getting stored in remote storage (i.e., EBS). That said, even when your task instances are reading data from a remote DataNode or remote service like S3 or EBS, it might not even be noticeable to the point that you need to worry about data locality.

  • Related