Home > Back-end >  unrecognized character in header of csv
unrecognized character in header of csv

Time:11-06

import csv

def readCSV(filename, begin_date="01/07/2020", end_date="30/09/2020"):
    file = open(filename)
    csvreader = csv.reader(file)
    header = []
    header = next(csvreader)

if __name__ == '__main__':
    raw_load_data = readCSV("Total_load_2020.csv")
    raw_forecast_data = readCSV("Total_load_forecast_2020.csv")

The data follows csv (downloaded online) and looks like follow:

RowDate,RowTime,TotalLoadForecast
01/01/2020,00:00,8600.52
01/01/2020,00:15,8502.06
01/01/2020,00:30,8396.45
...

But the output contains some weird characters (non-existing in data):

['RowDate', 'RowTime', 'TotalLoad']
['RowDate', 'RowTime', 'TotalLoadForecast']

Of course, I can easily remove it. But why does that happen in the first place?

CodePudding user response:

Yes, that's a BOM, represented in the CP1252 encoding^1.

I copied your sample CSV and ran it through GoCSV to know I was adding a BOM:

% gocsv clean -add-bom sample.csv > tmp
% mv tmp sample.csv
import csv

with open('sample.csv', 'r', newline='', encoding='cp1252') as f:
    # See if the first "char" is a BOM
    bom_chars = f.read(3)

    if (bom_chars != ''):
        f.seek(0)  #  Not a BOM, reset stream to beginning of file
    else:
        pass       # skip BOM

    reader = csv.reader(f)
    for row in reader:
        print(row)

If you were to read a file encoded with UTF-8, that BOM check will look like this:

with open('sample.csv', 'r', newline='') as f:  # utf-8 is the default encoding
    bom_char = f.read(1)

    if (bom_char != '\ufeff'):
        f.seek(0) #  Not a BOM, reset stream to beginning of file

or, let Python handle the guesswork for you to eliminate a BOM if it exists, with the utf_8_sig decoder:

with open('sample.csv', 'r', newline='', encoding='utf_8_sig') as f:
  • Related