Home > other >  How to deal with failing Athena queries as AWS Glue datacatalog metada size grows large?
How to deal with failing Athena queries as AWS Glue datacatalog metada size grows large?

Time:01-26

Based on my research, the easiest and the most straight forward way to get metadata out of Glue's Data Catalog, is using Athena and querying the information_schema database. The article below has come up frequently in my research and is written by Amazon's team:

Querying AWS Glue Data Catalog

However, under the section titled Considerations and limitations the following is written:

Querying information_schema is most performant if you have a small to moderate amount of AWS Glue metadata. If you have a large amount of metadata, errors can occur.

Unfortunately, in this article, there do not seem to be any indications or suggestion regarding what constitutes as "large amount of metadata" and exactly what errors could occur when the metadata is large and one needs to query the metadata. My question is, how to deal with the issue related to the ever growing size of data catalog's metadata so that one would never encounter errors when using Athena to query the metadata? Is there a best practice for this? Or perhaps a better solution for getting the same metadata that querying the catalog using Athena provides without multiple or great many API calls (using boto3, Hive DDL etc)?

CodePudding user response:

I talked to AWS Support and did some research on this. Here's what I gathered:

  • The information_schema is built at query execution time, there doesn't seem to be any caching.
  • If you access information_schema.tables, it will make separate calls for each schema you have to the Hive Metastore (Glue Data Catalog).
  • If you access information_schema.columns, it will make separate calls for each schema and each table in that schema you have to the Hive Metastore.
  • These queries are affected by the general service quotas. In this case, DML queries like your select must finish within 30 minutes.

If your Glue Data Catalog has many thousands of schemas, tables, and columns all of this may result in slow performance. As a rough guesstimate support told me that you should be fine as long as you have less than ~ 10000 tables, which should be the case for most people.

  • Related