How to determine if a string is located in AWS S3 CSV file-CodePudding

I have a CSV file in AWS S3. The file is very large 2.5 Gigabytes

The file has a single column of strings, over 120 million:

apc.com
xyz.com
ggg.com
dddd.com
...

How can I query the file to determine if the string xyz.com is located in the file? I only need to know if the string is there or not, I don't need to return the file.

Also it will be great if I can pass multiple strings for search and return only the ones that were found in the file.

For example:

Query => ['xyz.com','fds.com','ggg.com'] Will return => ['xyz.com','ggg.com']

CodePudding user response：

If you use the aws s3 cp command you can send the output to stdout:

aws s3 cp s3://yourbucket/foo.csv - | grep 'apc.com'

- The dash will send the output to stdout.

this are two examples of grep checking on multiple patterns:

aws s3 cp s3://yourbucket/foo.csv - | grep -e 'apc.com' -e 'dddd.com'

aws s3 cp s3://yourbucket/foo.csv - | grep 'apc.com\|dddd.com'

To learn more about grep, please look at the manual: GNU Grep 3.7

CodePudding user response：

The "S3 Select" SelectObjectContent API enables applications to retrieve only a subset of data from an object by using simple SQL expressions. Here's a Python example:

res = client.select_object_content(
  Bucket="my-bucket",
  Key="my.csv",
  ExpressionType="SQL",
  InputSerialization={"CSV": { "FileHeaderInfo": "NONE" }}, # or IGNORE, USE
  OutputSerialization={"JSON": {}},
  Expression="SELECT * FROM S3Object s WHERE _1 IN ['xyz.com', 'ggg.com']") # _1 refers to the first column

See this AWS blog post for an example with output parsing.