I'm trying to understand how to properly connect Redshift Spectrum with Hudi data.
Looks like I can directly create Redshift external table for data managed in Apache Hudi like it is described by the following documentation https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html The other way is to integrate Hudi with AWS Glue Data Catalog like it is mentioned here https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-how-it-works.html and then access Hudi tables with Redshift Spectrum via AWS Glue Data Catalog.
The same needs I have for AWS EMR for Apache Spark. Looks like I may use Hudi directly from EMR or via AWS Glue Data Catalog.
Right now, I don't understand what way to choose. Could you please advise what is the benefit to use Hudi via AWS Glue Data Catalog, or do I need to use it directly from Redshift Spectrum and AWS EMR ?
CodePudding user response:
Given that with Spark on EMR you need a catalog, Hive metastore if you will, then using the AWS Glue Catalog is an option.
If you elect to use Glue as metastore then use that as the source for all data. Unless errors are evident in which case use the Hudi api for Spark.