Home > Blockchain >  C# Streamwriter - Problem with the encoding
C# Streamwriter - Problem with the encoding

Time:04-14

I have some product data that I want to write into a csv file. First I have a function that writes the header into the csv file:

using(StreamWriter streamWriter = new StreamWriter(path))
{
    string[] headerContent = {"banana","apple","orange"};
    string header = string.Join(",", headerContent);
    streamWriter.WriteLine(header);
}

Another function goes over the products and writes their data into the csv file:

using (StreamWriter streamWriter = new StreamWriter(new FileStream(path, FileMode.Open), Encoding.UTF8))
{
    foreach (var product in products)
    {
        await streamWriter.WriteLineAsync(product.ToString());
    }
}

When writing the products into the csv file and do it with FileMode.Open and Encoding.UTF8, the encoding is set correctly into the file meaning that special characters in german or french get shown correctly. But the problem here is that I overwrite my header when I do it like this.

The solution I tried was to not use FileMode.Open but to use FileMode.Append which works, but then for some reason the encoding just gets ignored.

What could I do to append the data while maintaing the encoding? And also why is this happening in the first place?

EDIT:

Example with FileMode.Open:

Fußpflegecreme

Example with FileMode.Append:

Fußpflegecreme

CodePudding user response:

The important question here is: what does the file actually contain; for example, if I use the following:

using System.Text;

string path = "my.txt";
using (StreamWriter streamWriter = new StreamWriter(new FileStream(path, FileMode.Create), Encoding.UTF8))
{
    streamWriter.WriteLine("Fußpflegecreme 1");
}
using (StreamWriter streamWriter = new StreamWriter(new FileStream(path, FileMode.Append), Encoding.UTF8))
{
    streamWriter.WriteLine("Fußpflegecreme 2");
}
// this next line is lazy and inefficient; only good for quick tests
Console.WriteLine(BitConverter.ToString(File.ReadAllBytes(path)));

then the output is (re-formatted a little):

EF-BB-BF-
46-75-C3-9F-70-66-6C-65-67-65-63-72-65-6D-65-20-31-0D-0A-
46-75-C3-9F-70-66-6C-65-67-65-63-72-65-6D-65-20-32-0D-0A

The first line (note: there aren't any "lines" in the original hex) is the UTF-8 BOM; the second and third lines are the correctly UTF-8 encoded payloads. It would help if you could show the exact bytes that get written in your case. I wonder if the real problem here is that in your version, there is no BOM, but the rest of the data is correct. Some tools, in the absence of a BOM, will choose the wrong encoding. But also, some tools: in the presence of a BOM: will incorrectly show some garbage at the start of the file (and may also, because they're clearly not using the BOM: use the wrong encoding). The preferred option is: specify the encoding explicitly when reading the file, and use a tool that can handle the presence of absence of a BOM.

Whether or not to include a BOM (especially in the case of UTF-8) is a complex question, and there are pros/cons of each - and there are tools that will work better, or worse, with each. A lot of UTF-8 text files do not include a BOM, but: there is no universal answer. The actual content is still correctly UTF-8 encoded whether or not there is a BOM - but how that is interpreted (in either case) is up to the specific tool that you're using to read the data (and how that tool is configured).

CodePudding user response:

I think this will be solved once you explicitly choose the utf8 encoding when writing the header. This will prefix the file with a BOM.

  • Related