I have a Delta Lake on an s3 Bucket. Since I would like to use Spark's SQL API, I need to synchronize the Delta Lake with the local Spark session. Is there a quick way to have all the tables available, without having to create a temporary view for each one?
At the moment this is what I do (Let's suppose I have 3 tables into the s3_bucket_path "folder").
s3_bucket_path = 's3a://bucket_name/delta_lake/'
spark.read.format('delta').load(s3_bucket_path 'table_1').createOrReplaceTempView('table_1')
spark.read.format('delta').load(s3_bucket_path 'table_2').createOrReplaceTempView('table_2')
spark.read.format('delta').load(s3_bucket_path 'table_3').createOrReplaceTempView('table_3')
I was wondering if there was a quicker way to have all the tables available (without having to use boto3 and iterate through the folder to get the table names), or if I wasn't following the best practices in order to work with Spark Sql Apis: should I use a different approach? I've been studying Spark for a week and I'm not 100% familiar with its architecture yet.
Thank you very much for your help.
CodePudding user response:
Sounds like you'd like to use managed tables, so you have easy access to query the data with SQL, without manually registering views.
You can create a managed table as follows:
df.write.format("delta").saveAsTable("table_1")
The table path and schema information is stored in the Hive megastore (or another metastore if you've specified another metastore). Managed tables will prevent you from manually having to create the views yourself.