What is the difference between FileInputStream/FileOutputStream Vs FSDataInputStream/FSDataOutputStr-CodePudding

I am trying to understand the difference between FileInputStream Vs FSDataInputStream and FileOutputStream Vs FSDataOutputStream.

I am trying to read a file from S3 bucket and apply some formatting changes and then want to write it into another S3 bucket in the spark java application

I am confused about whether I need to use FileInputStream or FSDataInputStream to read files and how to write them into the S3 bucket using FileOutputStream or FSDataOutputStream.

Could someone explain how and where we need to use them appropriately with some example?

Greatly appreciated your help.

Thanks, Babu

CodePudding user response：

You can use either, it just depends on what you need.

They both are just stream implementations. Ultimately what you will be doing is taking and inputstream from one bucket and writing it to the outputstream of another.

The FileInputStream and FileOutputStream are concrete components that provide the functionality to read and write streams from a mapped file.

The FSDataInputStream and FSDataOutputStream are concrete decorators of an inputstream. Meaning the provide or decorate the inputstream with functionality, such as reading and writing primitives and providing buffered streams.

Which one to choose? Do you need a the decorations provided by FSDataOutputStream and FSDataInputStream? Is FileInputStream and FileOutputStream sufficient?

Personally, I would look to use Readers and Writers as demonstrated here:

How can I read an AWS S3 File with Java?

private final AmazonS3 amazonS3Client = AmazonS3ClientBuilder.standard().build();

private Collection<String> loadFileFromS3() {
    try (final S3Object s3Object = amazonS3Client.getObject(BUCKET_NAME,
                                                            FILE_NAME);
        final InputStreamReader streamReader = new InputStreamReader(s3Object.getObjectContent(), StandardCharsets.UTF_8);
        final BufferedReader reader = new BufferedReader(streamReader)) {
        return reader.lines().collect(Collectors.toSet());
    } catch (final IOException e) {
        log.error(e.getMessage(), e)
        return Collections.emptySet();
    }
}

CodePudding user response：

The classes FSDataInputStream and FSDataOutputStream are classes in hadoop-common

FSDataInputStream

FSDataInputStream adds high performance APIs for reading data at specific offsets into byte arrays (PositionedReadable) or bytebuffers. these are used extensively in libraries reading files where the reads are not sequential, more random IO. parquet, orc etc. FileSystem implementations often provide highly efficient implementations of these. These apis are not relevant for simple file copy unless you really are trying for maximum performance, have opened the same source file to multiple streams and are fetching blocks in parallel across them. Distcp does things like this, which is why it can overload networks if you try hard.

full specification, including of PositionedReadable: fsdatainputstream

FSDataOutputStream

FSDataOutputStream doesn't add that much to the normal DataOutputStream; the most important interface is Syncable, whose hflush and hsync calls have specific guarantees about durability, as in "when they return, the data has been persisted to HDFS or the other filesystem, for hsync, all the way to disk". If you are implementing a database like HBase, you need these and those guarantees. If you aren't, you don't.

full specification, including of syncable: outputstream

copying files in spark at scale

IF you want to copy files efficiently in spark, open the source files and dest files with a buffer of a few MB, read into the buffer, then write it back. you can distribute this work across the cluster for better parallelism, as this example does: https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/com/cloudera/spark/cloud/applications/CloudCp.scala

If you just want to copy one or two files, just do it in a single process, maybe multithreaded.
If you are really seeking performance against s3, take the list of files to copy, schedule the largest few files first so they don't hold you at the end, then randomize the rest of the list to avoid creating hotspots in the s3 bucket.