Home > Back-end >  Redshift copy statement losing spanish characters
Redshift copy statement losing spanish characters

Time:11-04

I have a csv file which containt the following value in one column:

Some words: camión, coche, moto                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   

The copy statement is:

COPY schema.table
FROM 's3://path/file.csv'
IAM_ROLE 'arn'
delimiter ',' IGNOREHEADER 1 FORMAT AS CSV QUOTE AS '"' ;

And it fails because it exchanges the word camión for cami..n:

Some words: cami..n, coche, moto     

So, it exceeds the DDL defined length.

Then I tried changing the encoding in the copy:

COPY schema.table
FROM 's3://path/file.csv'
IAM_ROLE 'arn'
delimiter ',' IGNOREHEADER 1 ENCODING AS UTF16 FORMAT AS CSV QUOTE AS '"' ;

and then it raises: . Bad UTF16 hex sequence: 2255736572222C2252616E6B696E6720 6E616D65222C2254797065222C224D6F

Is there anything that I am missing? I also changed the encoding in the csv to utf-16 when trying the other encoding using:

csv = df.to_csv(index=False, header=True, quoting=csv_set.QUOTE_ALL, sep=",",
                    encoding="utf-16")

CodePudding user response:

The issue with the DDL length is that Redshift stores varchars in multibyte UTF8 which, for non-ascii characters, takes up more than one byte in varchar length. The function octet_length() can be use to find the byte length of string which is longer than character length in these situations. DDL needs to be adjusted to account for these extra bytes when dealing with non-ascii strings.

The source file looks to be in UTF8 format as well so reading with UTF16 encoding will not work.

  • Related