I'm trying to understand the benefit's attained by HDFS splitting files in to blocks.
I would have expected increased read performance resulting from parallelism by concurrent reads to multiple blocks. However, I keep reading papers saying that reads/writes from/to HDFS are actually slower across the board than a single file on a local file system.
What performance benefits are achieved by HDFS splitting files into blocks?
CodePudding user response:
I keep reading papers saying that reads/writes from/to HDFS are actually slower across the board than a single file on a local file system.
If you can read a file on one machine then you don't need Hadoop. Hadoop is a distributed processing framework for huge datasets.
HDFS splits files into blocks because it's designed to handle files that are too big to be processed on one machine. It's not about increasing processing speed over small files, it's about giving you a way to process files that you couldn't process on one machine.
CodePudding user response:
In addition to the above, blocks fit well with replication for providing fault tolerance and availability. HDFS replicates each block to a small number of physically separate machines & in the event of a failure, only those blocks need to be re-replicated from across their alternative locations to other live machines.