I am writing an Airflow automation jobs which is extracting data tables from Snowflake warehouse and exporting them as csv flat files to sftp. I am exporting the csv files to local drive before sending them to S3 and then to sftp. Noticed that for some character sequence are getting replaced by special characters. Following is the Python code which is saving the Snowflake tables to csv in local drive.
Not passing the encoding attribute value to to_csv as the default value is utf-8.
import csv
import os
file_name = os.path.join(temp_file_path, _f)
query_output = cur.execute(_sql)
query_output.fetch_pandas_all().to_csv(file_name, index=False, quoting=csv.QUOTE_ALL,
header=False)
header = ','.join([col[0] for col in cur.description])
with open(file_name, 'r ') as f:
content = f.read()
f.seek(0, 0)
f.write(f'{header}\n' content)
For example.
There is an attribute whose value in Snowflake is Research – Productivity Support Scheme and in the exported csv to local drive its value is Research ‚Äì Productivity Support Scheme
Is there a way most probably as an argument to to_csv to stop the current behaviour. Any hint/suggestion will be highly valued.
Thanks
CodePudding user response:
I think, the problem is in how the csv file is being encoded.
I've found a Q&A, which the problem was partially same as yours.
Check this link below:
Encoding/Decoding Unicode and writing CSV
It writes the file correctly but you are probably displaying the file using an editor or console that is using Windows-1252 encoding.
- Editor or console or Excel, which doesn't UTF-8 by default, though you can fool it by putting a UTF-8-fake-BOM (U FEFF) at the start of the file or use utf-8-sig encoding.