Home > Software engineering >  how to stop Spark from changing varchar to string
how to stop Spark from changing varchar to string

Time:12-17

I have a Hive table with below schema:

hive> desc <DB>.<TN>;
id int,
name varchar(10),
reg varchar(8);

when I try to describe the same table on Spark (Pyspark shell) its converting Varchar to String.

spark.sql("""describe <DB>.<TN>""").show()
id int
name string
reg string

I would like to retain Hive datatypes while querying on Spark. Means I am expecting Varchar in the place of String. Does anyone know how to stop spark from inferring datatypes of its own? Thanks in Advance.

CodePudding user response:

There is no varchar in Apache Spark, it's all Strings. Yeah, this page says there is a VarcharType but it is only for schemas.

Once the data is in the dataframe, things are transparent. When you save the data, all should be back to varchar in Hive.

You can force a schema on reading the dataframe when it is available (like CSV for example), but I do not think it is available for Hive, which is already typed.

CodePudding user response:

I was going to tell you to just add a schema

schema = StructType([StructField('ID', IntegerType(), True),StructField('name', VarcharType(10), True),StructField('reg', VarcharType(8), True)])
df3 = sqlContext.createDataFrame(rdd, schema)

to a dataframe but data frames do not have a varchar type in spark <= 2.4. Which is likely why your varchars are being converted to StringType. That isn't to say that they aren't available in spark(2.4 >).

  • Related