I would like to provide numbers when creating a Spark dataframe. I have issues providing decimal type numbers.
This way the number gets truncated:
df = spark.createDataFrame([(10234567891023456789.5, )], ["numb"])
df = df.withColumn("numb_dec", F.col("numb").cast("decimal(30,1)"))
df.show(truncate=False)
# --------------------- ----------------------
#|numb |numb_dec |
# --------------------- ----------------------
#|1.0234567891023456E19|10234567891023456000.0|
# --------------------- ----------------------
This fails:
df = spark.createDataFrame([(10234567891023456789.5, )], "numb decimal(30,1)")
df.show(truncate=False)
TypeError: field numb: DecimalType(30,1) can not accept object 1.0234567891023456e 19 in type <class 'float'>
How to correctly provide big decimal numbers so that they wouldn't get truncated?
CodePudding user response:
Maybe this is related to some differences in floating points representation between Python and Spark. You can try passing string values when creating dataframe instead:
df = spark.createDataFrame([("10234567891023456789.5", )], ["numb"])
df = df.withColumn("numb_dec", F.col("numb").cast("decimal(30,1)"))
df.show(truncate=False)
# ---------------------- ----------------------
#|numb |numb_dec |
# ---------------------- ----------------------
#|10234567891023456789.5|10234567891023456789.5|
# ---------------------- ----------------------
CodePudding user response:
Try something as below -
from pyspark.sql.types import *
from decimal import *
schema = StructType([StructField('numb', DecimalType(30,1))])
data = [( Context(prec=30, Emax=999, clamp=1).create_decimal('10234567891023456789.5'), )]
df = spark.createDataFrame(data=data, schema=schema)
df.show(truncate=False)
----------------------
|numb |
----------------------
|10234567891023456789.5|
----------------------