Home > OS >  Special (Hungarian and Serbian) characters lost in the string when reading from file
Special (Hungarian and Serbian) characters lost in the string when reading from file

Time:10-26

I'm reading both Hungarian and Serbian words from a text document (which is tab delimited, exported from excel), then I'm writing them on the console. When I write it on the screen, it can't display characters that are outside the English ABC.

For example, instead of körte I get kĂśrte, and instead of kruška I get kruĹĄka.

I'm using streamreader (and later streamwriter), and I've set the encoding to iso-8859-2 for both of them, as well as for the output. This encoding includes both sets of characters I need.

Console.OutputEncoding = Encoding.GetEncoding("iso-8859-2");
using(StreamReader sr = new StreamReader(fIN, Encoding.GetEncoding("iso-8859-2"))) {
using(StreamWriter sw = new StreamWriter(fDB, Encoding.GetEncoding("iso-8859-2"))) {

I've tried to see whether it had trouble writing it on the console, so I just tried writing all these characters on the screen, and it displays everything with no problem.

Console.WriteLine("á Á é É í Í ó Ó ö Ö ü Ü ű Ű");
Console.WriteLine("č Č ć Ć đ Đ š Š ž Ž");
//outputs properly

I tried to see whether it had trouble storing these characters, so I've put them in a string and tried to display it, with no problems.

string s13 = "á Á é É í Í ó Ó ö Ö ü Ü ű Ű";
Console.WriteLine(s13);
s13 = "č Č ć Ć đ Đ š Š ž Ž ";
Console.WriteLine(s13);
//outputs properly

I tried to see where the problem is in runtime with debugging, and it seems like when I read the data from file, it is read wrong.

try {
    using(FileStream fs = new FileStream("DB.txt", FileMode.OpenOrCreate)) {
        using(StreamReader sr = new StreamReader(fs, Encoding.GetEncoding("iso-8859-2"))) {
            while(!sr.EndOfStream) {

                string[] s = sr.ReadLine().Split('\t');   //immeadiately becomes faulty, even if not split

                HuSrb word = new HuSrb(s[0], s[1]);
                bool found = false;
                foreach(Categories c in categories) {
                    if(c.Name == s[2]) {
                        c.Amount  ;
                        c.Words.Add(word);
                        found = true;
                        break;
                    }
                }
                if(!found) {
                    Categories category = new Categories(s[2], word);
                    categories.Add(category);
                }
            }
        }
    }
}
catch(Exception) {
    throw;
}

The funny thing is, later I read into a string from file A and write it into a string, then write the contents of that string into file B. Both file A and file B have the characters right, but in the middle, the string doesn't have the characters right.

So basically,

  1. The problem is not with storing the data
  2. The problem is not with printing the data
  3. The problem is not with writing the data into a file.

My assumption is that the problem is when reading from the file, but then I don't understand how it ends up being correct in the other file. Any help?

CodePudding user response:

The problem is that you probably used the wrong encoding while saving the input text file. I tried to read and write your example's content using another encoding and it works. The thing is that I saved the input file in UTF8 and read the content using Encoding.UTF8 :

Code: enter image description here

Results : enter image description here

  • Related