I have a CSV file in AWS S3. The file is very large 2.5 Gigabytes
The file has a single column of strings, over 120 million:
apc.com
xyz.com
ggg.com
dddd.com
...
How can I query the file to determine if the string xyz.com is located in the file? I only need to know if the string is there or not, I don't need to return the file.
Also it will be great if I can pass multiple strings for search and return only the ones that were found in the file.
For example:
Query => ['xyz.com','fds.com','ggg.com'] Will return => ['xyz.com','ggg.com']
CodePudding user response:
If you use the aws s3 cp
command you can send the output to stdout:
aws s3 cp s3://yourbucket/foo.csv - | grep 'apc.com'
-
The dash will send the output to stdout.
this are two examples of grep
checking on multiple patterns:
aws s3 cp s3://yourbucket/foo.csv - | grep -e 'apc.com' -e 'dddd.com'
aws s3 cp s3://yourbucket/foo.csv - | grep 'apc.com\|dddd.com'
To learn more about grep, please look at the manual: GNU Grep 3.7
CodePudding user response:
The "S3 Select" SelectObjectContent API enables applications to retrieve only a subset of data from an object by using simple SQL expressions. Here's a Python example:
res = client.select_object_content(
Bucket="my-bucket",
Key="my.csv",
ExpressionType="SQL",
InputSerialization={"CSV": { "FileHeaderInfo": "NONE" }}, # or IGNORE, USE
OutputSerialization={"JSON": {}},
Expression="SELECT * FROM S3Object s WHERE _1 IN ['xyz.com', 'ggg.com']") # _1 refers to the first column
See this AWS blog post for an example with output parsing.