Scala Amazon S3 streaming object-CodePudding

I am looking to use to Scala to get faster performance in accessing and downloading a Amazon S3 file. The file comes in as a InputStream and it is large (over 30 million rows).

I have tried this in python (pandas), but it is too slow. I am hoping to increase the speed with Scala.

So far I am doing this, but it is too slow. Have I encountered a bottle neck in the stream in that I cannot access data from the stream any faster than what I have with the code below?

val obj = amazonS3Client.getObject(bucket_name, file_name)
val reader = new BufferedReader(new InputStreamReader(obj.getObjectContent()))

while(line != null) {
list_of_lines = list_of_lines ::: List(line)
line = reader.readLine
}

I'm looking for serious speed improvement compared to the above approach. Thanks.

CodePudding user response：

I suspect that your performance bottleneck is appending to a (linked) List in that loop:

list_of_lines = list_of_lines ::: List(line)

With 30 million lines, it should take a few hundred trillion times longer to process all the lines than it takes to process one line. If the first iteration of that loop is 1ns, then this should take somewhere around 15 minutes to execute.

Switching to prepending to the List and then reversing at the end should improve the speed of your loop by a factor of more than a million:

while(line != null) {
  list_of_lines = line :: list_of_lines
  line = reader.readLine
}
list_of_lines = list_of_lines.reverse

List is also notoriously memory-inefficient, so for this many elements, it's may also be worth doing something like this (which is also more idiomatic Scala):

import scala.io.{ Codec, Source }

val obj = amazonS3Client.getObject(bucket_name, file_name)

val source = Source.fromInputStream(obj.getObjectContent())(Codec.defaultCharsetCodec)

val lines = source.getLines().toVector

Vector being more memory-efficient than List should dramatically reduce GC thrashing.

CodePudding user response：

The best way to achieve better performance is using the TransferManager provided by the AWS Java SDKs. It's a high level file transfer manager that will automatically parallelise downloads. I'd recommend using SDK v2, but the same can be done with SDK v1. Though, be aware, SDK v1 comes with limitations and only multipart files can be downloaded in parallel.

You need the following dependency (assuming you are using sbt with Scala). But note, there's no benefit of using Scala over Java with the TransferManager.

libraryDependencies  = "software.amazon.awssdk" % "s3-transfer-manager" % "2.17.243-PREVIEW"

Example (Java):

S3TransferManager transferManager = S3TransferManager.create();
FileDownload download = 
    transferManager.downloadFile(b -> b.destination(Paths.get("myFile.txt"))
                                       .getObjectRequest(req -> req.bucket("bucket").key("key")));
download.completionFuture().join();

I recommend reading more on the topic here: Introducing Amazon S3 Transfer Manager in the AWS SDK for Java 2.x