Fastest way to get exact count of rows for a 100GB CSV file stored on S3-CodePudding

What is the fastest way of getting an exact count of rows for a 100GB CSV file stored on Amazon S3 without using Athena nor any Fargate or EC2 VM? I can't use Athena, because the CSV file isn't clean-enough for it. I can't use Fargates or EC2 VMs, because I need a purely serverless solution. I can't use third-party services like Snowflake (native AWS services only).

Also, 100GB is too large to fit within a Lambda Function's /tmp (limited to 10GB). I could try to run something like DuckDB (or any other streaming database engine) on a Lambda and scan the entire file with a SELECT COUNT(*) FROM "s3://myBucket/myFile.csv" query, but the Lambda is quite likely to timeout, because its read bandwidth from S3 is 100MB/s at best, and it cannot run for more than 15 minutes (900s).

I know the approximate size of the file.

Note: I have an inaccurate estimate of the number of rows provided by AWS Glue Data Catalog's crawler, with an error margin of -50%/ 100%. This could be used for some kind of iterative or dichotomous process, but I could not figure any out. For example, I tried adding an OFFSET with a value lower than but close to the number of rows to the aforementioned query, but the Lambda running DuckDB timed out. That was disappointing and somewhat surprising, because a query like SELECT * FROM "s3://myBucket/myFile.csv" LIMIT 10 OFFSET 10000000 worked well.

CodePudding user response：

The fastest solution is probably to use SelectObjectContent with ScanRange to parallelize the request on chunks of 50MB or so.

CodePudding user response：

Have you tried "AWS S3 select":https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-glacier-select-sql-reference-select.html. It lets you run queries on S3 files. I use the service to get basic insight into any file on S3(Provided it can be queried).