Given a Delta table:
CREATE TABLE IF NOT EXISTS mytable (
...
)
USING DELTA
PARTITIONED BY part_a, part_b, part_c
LOCATION '/some/path/'
This table already has tons of data. However, the desired schema is:
CREATE TABLE IF NOT EXISTS mytable (
...
)
USING DELTA
PARTITIONED BY part_a, part_b -- <<-- ONLY part_a and part_b for partitions, ie, removing part_c
LOCATION '/some/path/'
How this schema change can be achieved?
CodePudding user response:
I eventually took the following approach:
- Backup the original table to be on the safe-side
- spark.read.table into memory
- df.write.option(“overwriteSchema”, “true”) to the original location
I chose this approach so I don’t need to change the original data location.
In more details:
1. Backup the original table to be on the safe-side
Since this was in on Databricks I could use their proprietary deep clone
feature:
create table mydb.mytable_backup_before_schema_migration_v1
deep clone mydb.mytable
location 'dbfs:/mnt/defaultDatalake/backups/zones/mydb/mytable_backup_before_schema_migration_v1'
If you are not in Databricks and don't have access to its deep clone
, you still can backup the table by reading and writing a copy to another place.
2. read and 3. overwrite with new schema
val df = spark.read.format("delta").table("mydb.mytable")
df
.write
.format("delta")
.mode("overwrite")
.partitionBy("part_a", "part_b")
.option("overwriteSchema", "true")
.saveAsTable("mydb.mytable") // same table, same location, but different data physical organization because partition changes