I am processing a csv file uploaded by users, the csv only has one column with the header row "API"
when i process the CSV, for one of the file i see that
"API".downcase.length displays 4
could it be a encoding issue. when i do header[0].downcase.bytes
for the string i see
[239, 187, 191, 97, 112, 105]
when i do "api".bytes i see
[97, 112, 105]
Any help in understanding why "API".downcase.length in above example display 4 would be really great.
I parse the file like
CSV.foreach(@file_path, headers: true) do |row|
Thanks.
CodePudding user response:
It looks like in this case the extra character is coming from a BOM (Byte Order Mark). These are hidden characters that are sometimes used to indicate the encoding type of the file.
One way to handle BOM characters is to specify the bom|utf-*
encoding when reading the file:
CSV.open(@file_path, "r:bom|utf-8", headers: true)
When bom|utf-*
is used, Ruby will check for a Unicode BOM in the input document to help determine the encoding, and if a BOM is found it is stripped out - Ruby's IO docs cover this in more detail.