I'm using an hdfs-sink-connector to consume Kafka's data into HDFS.
The Kafka connector writes data every 10 minutes, and sometimes the written file's size is really small; it varies from 2MB to 100MB. So, the written files actually waste my HDFS storage since each block size is 256MB.
The directory is created per date; so I wondered it would be great to merge many small files into one big file by daily batch. (I expected the HDFS will automatically divide one large file into block size as a result.)
I know there are many answers which say we could use spark's coalesce(1)
or repartition(1)
, but I worried about OOM error if I read the whole directory and use those functions; it might be more than 90GB~100GB if I read every file.
Will 90~100GB in HDFS be allowed? Am I don't need to be worried about it? Could anyone let me know if there is a best practice for merging small HDFS files? Thanks!
CodePudding user response:
So, the written files actually waste my HDFS storage since each block size is 256MB.
HDFS doesn't "fill out" the unused parts of the block. So a 2MB file only uses 2MB on disk (well, 6MB if you account for 3x replication). The main concern with small files on HDFS is that billions of small files can cause problems.
I worried about OOM error if I read the whole directory and use those functions
Spark may be an in-memory processing framework, but it still works if the data doesn't fit into memory. In such situations processing spills over onto disk and will be a bit slower.
Will 90~100GB in HDFS be allowed?
That is absolutely fine - this is big data after all. As you noted, the actual file will be split into smaller blocks in the background (but you won't see this unless you use hadoop fsck
).