Home > Enterprise >  spark 3.2.1 Apache spark table incompatible data type with parquet
spark 3.2.1 Apache spark table incompatible data type with parquet

Time:10-17

Situation:

  1. I get a parquet file generated for me every X amount of time. Can't change the column type of the file, nor parquet schema. Can't modify and rewrite the parquet to a new file location because it has to be picked up from there. Process for generating the parquet file can't/won't be changed.

  2. Using databricks with spark 3.2.1. Trying to create a table that points to the parquet file in (1) using the following code

    create database if not exists sampledb;
    drop table if exists sampledb.table;
    create table sampledb.table (ID BIGINT, Column1 string) 
    using parquet
    OPTIONS(path='/path/to/parquet/');
    
    
  3. I get the following error;

    com.databricks.sql.io.FileReadException: Error while reading file ........
    Parquet column cannot be converted. Column: [ID], Expected: LongType, Found: INT32
    

What data-type should i use when specifying the spark table schema so it can read parquet file? I'm open to using scala, py-spark and/or python if needed.

CodePudding user response:

BIGINT is an alias for LongType.

The ID column in the parquet schema is defined as int32.

Use the INT type for the ID column.

drop table if exists sampledb.table;
create table sampledb.table (ID INT, Column1 string) 
using parquet
OPTIONS(path='/path/to/parquet/');
  • Related