Home > OS >  spark write as string and read partition column as numeric
spark write as string and read partition column as numeric

Time:12-31

I found my business code have some illegal data, after debug ,I find this bug is caused by spark partitions resolve, what should I do to avoid this problem without change write partition columns.

import org.apache.spark.sql.functions.lit
import spark.implicits._

val df = Seq(("122D", 2), ("122F", 2), ("122", 2))
      .toDF("no", "value")
      .withColumn("other", lit(1))

val path = "/user/my/output"

df
  .write
  .partitionBy("no","value")
  .parquet(path)

my expected result is read as aame as write

df.show()
 ---- ----- ----- 
|  no|value|other|
 ---- ----- ----- 
|122D|    2|    1|
|122F|    2|    1|
| 122|    2|    1|
 ---- ----- ----- 

// df.distinct.count==3

actual read result like this

val read=spark.read.parquet(path)

read.show()
 ----- ----- ----- 
|other|   no|value|
 ----- ----- ----- 
|    1|122.0|    2|
|    1|122.0|    2|
|    1|122.0|    2|
 ----- ----- ----- 

// read.distinct.count==1

check the output dir structure is this

└─output
    ├─no=122
    │  └─value=2
    ├─no=122D
    │  └─value=2
    └─no=122F
        └─value=2

Thanks a lot. my spark version is 2.4.5 and scala version is 2.11.12

CodePudding user response:

just add spark.conf.set("spark.sql.sources.partitionColumnTypeInference.enabled",false)

CodePudding user response:

For theoretical knowledge : All built-in file sources (including Text/CSV/JSON/ORC/Parquet) are able to discover and infer partitioning information automatically. Data types of the partitioning columns are automatically inferred.

You can use : spark.sql.sources.partitionColumnTypeInference.enabled as False.

Make Sure That : When type inference is disabled, string type will be used for the partitioning columns.

  • Related