I am new to Snowflake and ELT in general, is there a significant performance difference when loading the data into Snowflake using "INSERT INTO" Vs. "COPY INTO" from an External stage?
Would using "INSERT INTO" instead of "COPY INTO" make this data ingestion row-by-row processing? Which from my understanding hurts performance for column-oriented databases like Snowflake.
Using "INSERT INTO" to load data from an External Stage into Snowflake raw tables:
-- BEGIN COPY INTO PROCESS
INSERT INTO ORDERS.RAW.FACT_ORDERS (
ID,
ORDER_ID,
PRODUCT_ID,
PRODUCT_PRICE,
QUANTITY,
SALE_FACTOR,
FINAL_PRODUCT_PRICE,
PURCHASE_DATE,
ORDER_RETURN_FLAG,
RETURN_ID,
CUSTOMER_ID,
STORE_ID,
EMPLOYEE_ID,
_METADATA_PARTITION_DATE,
_METADATA_FILE_NAME,
_METADATA_CREATED_BATCH_ID,
_METADATA_UPDATED_BATCH_ID,
_METADATA_CREATED_DATE_TIME,
_METADATA_UPDATED_DATE_TIME
)
SELECT DISTINCT
stg.$1,
stg.$2,
stg.$3,
stg.$4,
stg.$5,
stg.$6,
stg.$7,
stg.$8,
stg.$9,
stg.$10,
stg.$11,
stg.$12,
stg.$13,
to_date($process_date, 'YYYYMMDD') as _METADATA_PARTITION_DATE,
METADATA$filename::varchar(512) as _METADATA_FILE_NAME,
$batch_id,
$batch_id,
$batch_timestamp,
$batch_timestamp
FROM
@ORDERS.RAW.STAGE (pattern => $file_pattern) stg
Using "COPY INTO" to load data from an External Stage into Snowflake raw tables:
-- BEGIN COPY INTO PROCESS
COPY INTO ORDERS.RAW.FACT_ORDERS (
ID,
ORDER_ID,
PRODUCT_ID,
PRODUCT_PRICE,
QUANTITY,
SALE_FACTOR,
FINAL_PRODUCT_PRICE,
PURCHASE_DATE,
ORDER_RETURN_FLAG,
RETURN_ID,
CUSTOMER_ID,
STORE_ID,
EMPLOYEE_ID,
_METADATA_PARTITION_DATE,
_METADATA_FILE_NAME,
_METADATA_CREATED_BATCH_ID,
_METADATA_UPDATED_BATCH_ID,
_METADATA_CREATED_DATE_TIME,
_METADATA_UPDATED_DATE_TIME
)
SELECT DISTINCT
stg.$1,
stg.$2,
stg.$3,
stg.$4,
stg.$5,
stg.$6,
stg.$7,
stg.$8,
stg.$9,
stg.$10,
stg.$11,
stg.$12,
stg.$13,
to_date($process_date, 'YYYYMMDD') as _METADATA_PARTITION_DATE,
METADATA$filename::varchar(512) as _METADATA_FILE_NAME,
$batch_id,
$batch_id,
$batch_timestamp,
$batch_timestamp
FROM
@ORDERS.RAW.STAGE (pattern = $file_pattern) stg
I am trying to understand the best practice for loading large amounts of data from AWS S3 into my Snowflake raw tables.
CodePudding user response:
INSERT INTO
- This is optimized to load the data in micro partitions in ideal state.
COPY INTO
- This is optimized to just load the data.
So, if you have very light or no transformations(like your example), use INSERT INTO
but if you have some transformations, please use COPY INTO
to load into a temp table as-is and then use INSERT INTO
to transform it.
In this kind of scenario, SELECT takes most of the time and not the insert so check your SQL profile after running both SQL to determine the best solution.