Home > Software engineering >  How to get latest data by AWS Glue
How to get latest data by AWS Glue

Time:12-03

I manage some data in AWS, and there are some parquet files in a S3 bucket. Everyday, new files will added to this bucket, and I would like to get the data in latest file by using Athena.

I want to know how to designate the latest file path in Athena Query. Is it possible to recognize the latest file from path of each parquet file?

CodePudding user response:

Presto DB (now Trino) is the engine on which Athena is based. Support for querying the file timestamp has been recently added, but it's likely to take a while before it's available on Athena (probably years).

In the meantime, if your parquet files include a timestamp in the name you could do something like:

select * from mydb 
where "$path" in 
(
   select "$path" 
   from my db
   order by "$path" desc 
   limit 1
)
  • Related