When trying to read a parquet file with the openrowset
and bulk
options I encounter the following error:
Error handling external file: 'Inserting value to batch for column type DATE failed. Invalid argument provided.'
This problem indicates that the file couldn't be read because the data contains values that are not DATE type - indeed, there are undefined
and 0001-01-01
records, so when the BULK
operation is performed these values prevent from loading the entire parquet file (see the code from below).
SELECT
TOP 100
POST_SEARCH_DATE,
SEARCH_DATE
FROM
OPENROWSET(
BULK 'https://test/refined-parquet/data/v1.0/loaddt=2022-01-01/**',
FORMAT = 'PARQUET'
) AS [result]
One alternative I thought was to convert the column to a varchar to bypass the error but unfortunately, that didn't make any difference (see below).
SELECT
TOP 100
POST_SEARCH_DATE,
SEARCH_DATE
FROM
OPENROWSET(
BULK 'https://test/refined-parquet/data/v1.0/loaddt=2022-01-01/**',
FORMAT = 'PARQUET'
) AS [result]
WHERE cast(POST_SEARCH_DEPARTURE_DATE as varchar(100))!= 'undefined'
and cast (SEARCH_DEPARTURE_DATE as varchar(100)) != 'undefined'
and cast (SEARCH_DEPARTURE_DATE as varchar(100)) != '0001-01-01'
and cast(POST_SEARCH_DEPARTURE_DATE as varchar(100))!= '0001-01-01'
I found some suggestions about inserting every record into a temporary table and from there you can disregard the records that fail. Can someone help me to give me more details about this solution or another alternative?
CodePudding user response:
Have you tried using WITH
to define the data types, while using OPENROWSET?
SELECT
TOP 100
POST_SEARCH_DATE,
SEARCH_DATE,
EXAMPLE_COL
FROM
OPENROWSET(
BULK 'https://test/refined-parquet/data/v1.0/loaddt=2022-01-01/**',
FORMAT = 'PARQUET'
WITH (
[SEARCH_DEPARTURE_DATE] VARCHAR(100) COLLATE Latin1_General_BIN2 3,
[POST_SEARCH_DEPARTURE_DATE] VARCHAR(100) COLLATE Latin1_General_BIN2 2,
[EXAMPLE_COL] VARCHAR(100)
) AS [r]