I'm trying to update the content of a redshift cluster table using pyspark doing the following:
content= spark.read \
.format("com.databricks.spark.redshift") \
.option("aws_iam_role", role_arn) \
.option("url", host) \
.option("dbtable", "schema.table") \
.option("user", user) \
.option("password", pass) \
.option("tempdir", aws_bucket_name) \
.load()
content = content.withColumn('column', lit("test"))
content.write \
.format("com.databricks.spark.redshift") \
.option("aws_iam_role", role_arn) \
.option("url", host) \
.option("user", user) \
.option("password", pass) \
.option("dbtable", "schema.table") \
.option("tempdir", aws_bucket_name) \
.mode("overwrite") \
.save()
The table content is saved properly but after the overwrite operation the rest of users of the redshift cluster loose their privileges over the table (They can not select, update, etc...)
I have read that this is because internally spark delete and creates a new table. Is there any way of updating the content of the table from spark that don't remove permissions?
CodePudding user response:
This is caused due to overwrite operation also delete metadata of the table.
Reading spark docs there are 2 options that helps:
preactions
You can use it to truncate the table and after that just append the data instead of overwriting it.
postactions
To GRANT privileges back after overwriting operation.
CodePudding user response:
Loren is correct in his suggestions but there is another action you can take. You can set the user's default privileges in Redshift so that any object this user makes has the permissions automatically. This would only need to be done once for this user and wouldn't impact your current code.
To set a user's default privileges you use the ALTER_DEFAUL_PRIVILEGES command on Redshift. A user can change their own default privileges or it can be done by a superuser (admin). See:
https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_DEFAULT_PRIVILEGES.html