Home > Mobile >  Partition bucket by year and month in PySpark
Partition bucket by year and month in PySpark

Time:10-28

I have a DF like:

Cod   Date
1     2022-01-01
1     2022-01-10
1     2022-02-01
2     2022-03-01
2     2022-04-01

I'm trying to use Apache Iceberg to partition my DF by Cod/Year/Month using hidding partitioning.

spark.sql("CREATE TABLE local.table USING iceberg PARTITIONED BY (Cod, years(Date), months(Date)) as (SELECT * FROM table_df);")

I got the following error:

pyspark.sql.utils.IllegalArgumentException: Cannot add redundant partition: 1001: Date_year: year(2) conflicts with 1002: Date_month: month(2)

According to Iceberg docs, this error is intentional, the workaround is:

One workaround is to update the table with additional time based partitions after creating it with a single time based partition.

So I've tried to update my partitions using (I removed the months(Date) from the first query when trying this approach):

spark.sql("ALTER TABLE local.table SET PARTITION SPEC (months(Date));")

Although I got the following error:

no viable alternative at input 'ALTER TABLE local.table SET PARTITION'(line 1, pos 28)

I also tried PySpark approach, like:

table.writeTo("local.table").partitionedBy(f.col("Code"), f.year(f.col("Date")), f.month(f.col("Date"))).create()

But I got the following error:

pyspark.sql.utils.AnalysisException: Invalid partition transformation: year(DataReferencia)

What Can I do to solve this problem?

CodePudding user response:

After a lot of research, Iceberg does not accept redundant partitions. The solution that I found to partition by year and month was:

First I created a table partitioned only by year

CREATE TABLE local.table(
    ...
) USING iceberg PARTITIONED BY (years(ts));

Then I partitioned by month using ALTER TABLE

ALTER TABLE local.table ADD PARTITION FIELD months(ts)
  • Related