failing to load a csv with json string from S3 to redshift-CodePudding

I have a csv with rows that look the following way:

2021-08-20,2021-10-04,2021-10-04,148355456455712,Accountname,USD,"[{'action_type': 'add_to_cart', 'value': '266.63', '1d_click': '266.63', '7d_click': '266.63'}, {'action_type': 'initiate_checkout', 'value': '213.03', '1d_click': '213.03', '7d_click': '213.03'}, {'action_type': 'view_content', 'value': '762.75', '1d_click': '762.75', '7d_click': '762.75'}, {'action_type': 'omni_add_to_cart', 'value': '266.63', '1d_click': '266.63', '7d_click': '266.63'}, {'action_type': 'omni_initiated_checkout', 'value': '213.03', '1d_click': '213.03', '7d_click': '213.03'}, {'action_type': 'omni_view_content', 'value': '762.75', '1d_click': '762.75', '7d_click': '762.75'}, {'action_type': 'add_to_cart', 'value': '266.63', '1d_click': '266.63', '7d_click': '266.63'}, {'action_type': 'initiate_checkout', 'value': '213.03', '1d_click': '213.03', '7d_click': '213.03'}]"

I am trying to load this CSV to a redshift table with the following schema:

Columns             Type    Nullable    Length  Precision

date_start          varchar true        256     256
date_stop           varchar true        256     256
created_time        varchar true        256     256
account_id          int8    true        19      19
account_name        varchar true        256     256
account_currency    varchar true        256     256
action_values       varchar true        256     256

I'm using the following DML statement:

copy table_name
from 's3://bucket_name/subdirectory/filename.csv'
delimiter ','
ignoreheader 1
csv quote as '"'
dateformat 'auto'
timeformat 'auto'
access_key_id '...'
secret_access_key '...'
   ;

and i get this error: Load into table 'table_name' failed. Check 'stl_load_errors' system table for details.

when I look at stl_load_errors table this is what i see:

query   substring   line    value           err_reason

93558   ...         2   2021-08-20          Invalid digit, Value '[', Pos 0, Type: Long
93558   ...         2   2021-10-04          Invalid digit, Value '[', Pos 0, Type: Long
93558   ...         2   2021-10-04          Invalid digit, Value '[', Pos 0, Type: Long
93558   ...         2   148355456455712     Invalid digit, Value '[', Pos 0, Type: Long
93558   ...         2   Accountname         Invalid digit, Value '[', Pos 0, Type: Long
93558   ...         2   USD                 Invalid digit, Value '[', Pos 0, Type: Long

I just cant figure out why it isn't working but i guess it has something to do with the json string. Also I can't understand whare this "Type: Long" is coming from.

I am trying to avoid using Json files as input...

can anyone help?

CodePudding user response：

Your data does look reasonable (as far as I can tell) except that the last field (the json data) is longer than 256 characters. This however is not the error you are showing. The format of stl_load_errors isn't the format the table is in so I assume you are doing some processing on this table in what you are showing in your question. The "Type:Long" is referring to the INT8 of your table DDL - INT8 is also known as a Big Int or a Long Int.

I think your issue is that you haven't specified that the COPY is reading a CSV file and the default format is DELIMITED. For Redshift COPY to follow CSV rules you need to specify that the file format is CSV. Specifically I suspect that the double quotes are providing the data value grouping you are expecting.

CodePudding user response：

I worked it out!

since Redshift is parsing every \' as a string and every \, as a delimiter if not within a string. the solution is to replace all \' with \" within a json string so it would look the following way:

2021-08-20,2021-10-04,2021-10-04,148355456455712,Accountname,USD,'[{"action_type": "add_to_cart", "value": "266.63", "1d_click": "266.63", "7d_click": "266.63"}, {"action_type": "initiate_checkout", "value": "213.03", "1d_click": "213.03", "7d_click": "213.03"}, {"action_type": "view_content", "value": "762.75", "1d_click": "762.75", "7d_click": "762.75"}, {"action_type": "omni_add_to_cart", "value": "266.63", "1d_click": "266.63", "7d_click": "266.63"}, {"action_type": "omni_initiated_checkout", "value": "213.03", "1d_click": "213.03", "7d_click": "213.03"}, {"action_type": "omni_view_content", "value": "762.75", "1d_click": "762.75", "7d_click": "762.75"}, {"action_type": "add_to_cart", "value": "266.63", "1d_click": "266.63", "7d_click": "266.63"}, {"action_type": "initiate_checkout", "value": "213.03", "1d_click": "213.03", "7d_click": "213.03"}]'

since I sed pandas the preprocessing code was the following:

for col in array_cols:
    df[col] = df[col].str.replace('\'','\"')