I have a DF like:
Cod Date
1 2022-01-01
1 2022-01-10
1 2022-02-01
2 2022-03-01
2 2022-04-01
I'm trying to use Apache Iceberg to partition my DF by Cod/Year/Month
using hidding partitioning.
spark.sql("CREATE TABLE local.table USING iceberg PARTITIONED BY (Cod, years(Date), months(Date)) as (SELECT * FROM table_df);")
I got the following error:
pyspark.sql.utils.IllegalArgumentException: Cannot add redundant partition: 1001: Date_year: year(2) conflicts with 1002: Date_month: month(2)
According to Iceberg docs, this error is intentional, the workaround is:
One workaround is to update the table with additional time based partitions after creating it with a single time based partition.
So I've tried to update my partitions using (I removed the months(Date)
from the first query when trying this approach):
spark.sql("ALTER TABLE local.table SET PARTITION SPEC (months(Date));")
Although I got the following error:
no viable alternative at input 'ALTER TABLE local.table SET PARTITION'(line 1, pos 28)
I also tried PySpark approach, like:
table.writeTo("local.table").partitionedBy(f.col("Code"), f.year(f.col("Date")), f.month(f.col("Date"))).create()
But I got the following error:
pyspark.sql.utils.AnalysisException: Invalid partition transformation: year(DataReferencia)
What Can I do to solve this problem?
CodePudding user response:
After a lot of research, Iceberg does not accept redundant partitions. The solution that I found to partition by year and month was:
First I created a table partitioned only by year
CREATE TABLE local.table(
...
) USING iceberg PARTITIONED BY (years(ts));
Then I partitioned by month using ALTER TABLE
ALTER TABLE local.table ADD PARTITION FIELD months(ts)