import csv
def readCSV(filename, begin_date="01/07/2020", end_date="30/09/2020"):
file = open(filename)
csvreader = csv.reader(file)
header = []
header = next(csvreader)
if __name__ == '__main__':
raw_load_data = readCSV("Total_load_2020.csv")
raw_forecast_data = readCSV("Total_load_forecast_2020.csv")
The data follows csv (downloaded online) and looks like follow:
RowDate,RowTime,TotalLoadForecast
01/01/2020,00:00,8600.52
01/01/2020,00:15,8502.06
01/01/2020,00:30,8396.45
...
But the output contains some weird characters (non-existing in data):
['RowDate', 'RowTime', 'TotalLoad']
['RowDate', 'RowTime', 'TotalLoadForecast']
Of course, I can easily remove it. But why does that happen in the first place?
CodePudding user response:
Yes, that's a BOM, represented in the CP1252 encoding^1.
I copied your sample CSV and ran it through GoCSV to know I was adding a BOM:
% gocsv clean -add-bom sample.csv > tmp
% mv tmp sample.csv
import csv
with open('sample.csv', 'r', newline='', encoding='cp1252') as f:
# See if the first "char" is a BOM
bom_chars = f.read(3)
if (bom_chars != ''):
f.seek(0) # Not a BOM, reset stream to beginning of file
else:
pass # skip BOM
reader = csv.reader(f)
for row in reader:
print(row)
If you were to read a file encoded with UTF-8, that BOM check will look like this:
with open('sample.csv', 'r', newline='') as f: # utf-8 is the default encoding
bom_char = f.read(1)
if (bom_char != '\ufeff'):
f.seek(0) # Not a BOM, reset stream to beginning of file
or, let Python handle the guesswork for you to eliminate a BOM if it exists, with the utf_8_sig
decoder:
with open('sample.csv', 'r', newline='', encoding='utf_8_sig') as f: