I found my business code have some illegal data, after debug ,I find this bug is caused by spark partitions resolve, what should I do to avoid this problem without change write partition columns.
import org.apache.spark.sql.functions.lit
import spark.implicits._
val df = Seq(("122D", 2), ("122F", 2), ("122", 2))
.toDF("no", "value")
.withColumn("other", lit(1))
val path = "/user/my/output"
df
.write
.partitionBy("no","value")
.parquet(path)
my expected result is read as aame as write
df.show()
---- ----- -----
| no|value|other|
---- ----- -----
|122D| 2| 1|
|122F| 2| 1|
| 122| 2| 1|
---- ----- -----
// df.distinct.count==3
actual read result like this
val read=spark.read.parquet(path)
read.show()
----- ----- -----
|other| no|value|
----- ----- -----
| 1|122.0| 2|
| 1|122.0| 2|
| 1|122.0| 2|
----- ----- -----
// read.distinct.count==1
check the output
dir structure is this
└─output
├─no=122
│ └─value=2
├─no=122D
│ └─value=2
└─no=122F
└─value=2
Thanks a lot.
my spark version is 2.4.5
and scala version is 2.11.12
CodePudding user response:
just add spark.conf.set("spark.sql.sources.partitionColumnTypeInference.enabled",false)
CodePudding user response:
For theoretical knowledge : All built-in file sources (including Text/CSV/JSON/ORC/Parquet) are able to discover and infer partitioning information automatically. Data types of the partitioning columns are automatically inferred.
You can use : spark.sql.sources.partitionColumnTypeInference.enabled
as False.
Make Sure That : When type inference is disabled, string type will be used for the partitioning columns.