I'm trying to build an application that uses Athena to present Cloudtrail logs. a simple query takes around 4 mins. it scans around 200 GB. What I'm looking for is: (at least one of those fields is used in the query) [ Event name, Event time, User name, Event source, Resource type, Resource name ]
e.g.
SELECT eventid,
eventname,
eventsource,
resources [ 1 ].arn,
resources [ 1 ].type,
useridentity.username
FROM My_Table
WHERE useridentity.username = 'username'
AND eventtime BETWEEN '2022-07-10T13:14' AND '2022-07-27T13:14'
How can I optimize the query time? I read Top 10 Performance Tuning Tips for Amazon Athena. I'm trying to partition the data, but all the articles and examples I found are not much of a help.
Can someone please provide me with a way to partition my data? Or if there's another way to accelerate the performance. (I already have a table, so I want to ALTER
it, my S3 bucket URI is something like this:
XXXXXXXXXXXXXXX/us-east-1/2022/07/25/
CodePudding user response:
I would recommend you first create a table in Snappy-compressed Parquet format.
You can create a new table from the existing table and convert formats.
From Examples of CTAS queries - Amazon Athena:
CREATE TABLE new_table
WITH (
format = 'Parquet',
write_compression = 'SNAPPY',
external_location ='s3://my-bucket/tables/parquet_table/')
AS SELECT *
FROM old_table;
Note that this will create new data files in the location specified. Keep in mind that deleting the table in Amazon Athena will not delete the data in the Amazon S3 bucket.
You can then compare performance using new_table
.
If you then want to add partitioning, run the same command with partitioned_by
.
For details, see: CREATE TABLE AS - Amazon Athena