Home > Software engineering >  AWS Athena tables for BI tools
AWS Athena tables for BI tools

Time:10-28

I'm did ETL for our data and did simple aggregations on it in Athena. Our plan is to use our BI tool to access those tables from Athena. It works for now, but I'm worried that these tables are static i.e. they only reflect the data since I last created the Athena table. When called, are Athena tables automatically ran again? If not, how do I make them be automatically updated when called by our BI tool?

My only solution thus far to overwrite the tables we have is by running two different queries: one query to drop the table, and another to re-create the table. Since it's two different queries, I'm not sure if you can run it all at the same time (at least in Athena, you can't run them all in one go).

CodePudding user response:

Amazon Athena is a query engine, not a database.

When a query is sent to Amazon Athena, it looks at the location stored in the table's DDL. Athena then goes to the Amazon S3 location specified and scans the files for the requested data.

Therefore, every Athena query always reflects the data shown in the underlying Amazon S3 objects:

  • Want to add data to a table? Then store an additional object in that location.
  • Want to delete data from a table? Then delete the underlying object that contains that data.

There is no need to "drop a table, then re-create the table". The table will always reflect the current data stored in Amazon S3. In fact, the table doesn't actually exist -- rather, it is simply a definition of what the table should contain and where to find the data in S3.

The best use-case for Athena is querying large quantities of rarely-accessed data stored in Amazon S3. If the data is often accessed and updated, then a traditional database or data warehouse (eg Amazon Redshift) would be more appropriate.

Pointing a Business Intelligence tool to Athena is quite acceptable, but you need to have proper processes in place for updating the underlying data in Amazon S3.

I would also recommend storing the data in Snappy-compressed Parquet files, which will make Athena queries faster and lower cost (because it is charged based upon the amount of data read from disk).

  • Related