Home > Software engineering >  Specify column datatype in pyspark
Specify column datatype in pyspark

Time:09-21

I am reading an xml file using Pyspark sql and load it as a data frame. the schema looks something like this:

    root
 |-- AuditFileCountry: string (nullable = true)
 |-- AuditFileDateCreated: date (nullable = true)
 |-- AuditFileVersion: double (nullable = true)
 |-- Company: struct (nullable = true)
 |    |-- Address: struct (nullable = true)
 |    |    |-- City: string (nullable = true)
 |    |    |-- Country: string (nullable = true)
 |    |    |-- Number: string (nullable = true)
 |    |    |-- PostalCode: long (nullable = true)
 |    |    |-- StreetName: string (nullable = true)
 |    |-- BankAccount: struct (nullable = true)
 |    |    |-- BankAccountNumber: string (nullable = true)
 |    |    |-- CurrencyCode: string (nullable = true)

The problem is that column PostalCode under Address and Company Structs in the source file has a value like 01234, however as you can see from the schema, this column is being read as Long datatype, and in that case the value in the data frame looks like 1234 with 0 being lost. And even if I convert datatype to StringType later on, the 0 will be lost anyway.

Is there a way to specify the datatype of this column as StringType when loading data into dataframe?

I know that I could do it using something like

schema = StructType([
    StructField('PostalCode', StringType(), True)
])

and then pass this schema when loading the data, but the schema of dataframe is nested, and it seems you can't specify that column's datatype simply like that.

Any idea how I could fix this issue? Any help would be very appreciated!

CodePudding user response:

You cannot simply provide the data type for a single input column. There are two options.

The first one is to completely disable the schema inference by setting the inferSchema option of the reader to False:

spark.read \
  .format('xml') \
  .option('inferSchema', False) \
  ...

This will result in all XML fields being represented as strings in the dataset and you'll need to manually convert wherever necessary. If the schema is fixed, the better option would be to provide the full schema. In your case, that would be something like:

schema = StructType([
  StructField('AuditFileCountry', StringType, True),
  StructField('AuditFileDateCreated', DateType, True),
  StructField('AuditFileVersion', DoubleType, True),
  StructField('Company', StructType([
    StructField('Address', StructType([
      StructField('City', StringType, True),
      StructField('Country', StringType, True),
      StructField('Number', StringType, True),
      StructField('PostalCode', StringType, True),
      StructField('StreetName', StringType, True)
    ], True),
    StructType('BankAccount', StructType([
      StructField('BankAccountNumber', StringType, True),
      StructField('CurrencyCode', StringType, True)
    ], True)
  ], True)
])
  • Related