Home > front end >  Schema change in Delta table - How to remove a partition from the table schema without overwriting?
Schema change in Delta table - How to remove a partition from the table schema without overwriting?

Time:01-25

Given a Delta table:

CREATE TABLE IF NOT EXISTS mytable (
 ...
)
USING DELTA
PARTITIONED BY part_a, part_b, part_c
LOCATION '/some/path/'

This table already has tons of data. However, the desired schema is:

CREATE TABLE IF NOT EXISTS mytable (
 ...
)
USING DELTA
PARTITIONED BY part_a, part_b -- <<-- ONLY part_a and part_b for partitions, ie, removing part_c
LOCATION '/some/path/'

How this schema change can be achieved?

CodePudding user response:

I eventually took the following approach:

  1. Backup the original table to be on the safe-side
  2. spark.read.table into memory
  3. df.write.option(“overwriteSchema”, “true”) to the original location

I chose this approach so I don’t need to change the original data location.

In more details: 1. Backup the original table to be on the safe-side Since this was in on Databricks I could use their proprietary deep clone feature:

create table mydb.mytable_backup_before_schema_migration_v1 
deep clone mydb.mytable
location 'dbfs:/mnt/defaultDatalake/backups/zones/mydb/mytable_backup_before_schema_migration_v1'

If you are not in Databricks and don't have access to its deep clone, you still can backup the table by reading and writing a copy to another place.

2. read and 3. overwrite with new schema

val df = spark.read.format("delta").table("mydb.mytable")

df
.write
.format("delta")
.mode("overwrite")
.partitionBy("part_a", "part_b")
.option("overwriteSchema", "true")
.saveAsTable("mydb.mytable") // same table, same location, but different data physical organization because partition changes
  • Related