I have taken pg dump of large tables (>50 GB) in AWS S3 in csv.gz format from postgresql RDS. I want to store this data into AWS athena in the form of actual tables as in the RDS, so that when requirement is there i can directly use the data without restoring those tables in the RDS.
Any suggestions will be much appreciated.
CodePudding user response:
If each file contains data from one table in CSV format, you can create a table to query that data.
I find the easiest way is to use AWS Glue:
- Put each file relating to a table in a separate subdirectory in S3
- Create a crawler in AWS Glue, pointing it to the directory
- It will automatically create a table in Amazon Athena that lets you query the data
If you wish to improve performance of the queries, I would recommend that you convert the data into Snappy-compressed Parquet format.
From Examples of CTAS queries - Amazon Athena:
CREATE TABLE new_table
WITH (
format = 'Parquet',
write_compression = 'SNAPPY')
AS SELECT *
FROM old_table;