ASCII, ISO 8859-1, Unicode in C how does it work?-CodePudding

Well, I'm really in doubt, how does C work with encodings, well first I have a C file, saved with ISO 8859-1 encoding, with test.c content, when running the program the character ÿ is not displayed correctly on the linux console, I know that by default it uses utf-8, but if utf-8 uses the same 256 characters as ISO 8859-1, why doesn't the program correctly display the 'ÿ' character? Another question, why does test2 correctly display the 'ÿ' character? where the test2.c file is a UTF-8 and also the file.txt is a UTF-8 ? In other words, wasn't the compiler to complain about the width being multi-character?

test1.c

  // ISO 8859-1
  #include <stdio.h>

  int main(void)
  {
    unsigned char c = 'ÿ';
    putchar(c);
    return 0;
  }

  $ gcc -o test1 test1.c
  $ ./test1
  $ ▒

test2.c

  // ASCII
  #include <stdio.h>

  int main(void) 
  {

     FILE *fp = fopen("file.txt", "r ");
     int c;

     while((c = fgetc(fp)) != EOF)
        putchar(c);
     return 0;
 }

file.txt: UTF-8 abcdefÿghi

  $ gcc -o test2 test2.c
  $ ./test2
  $ abcdefÿghi

well, that's it, if you can help me giving details about it I would be very grateful, :)

CodePudding user response：

Character encodings can be confusing for many reasons. Here are some explanations:

In the ISO 8859-1 encoding, the character y with a diaeresis ÿ (originally a ligature of i and j) is encoded as a byte value of 0xFF (255). The first 256 code points in Unicode do correspond to the same characters as the ones from ISO 8859-1, but the popular UTF-8 encoding for Unicode uses 2 bytes for code points larger than 127, so ÿ is encoded in UTF-8 as 0xC3 0xBF.

When you read the file file.txt, your program reads one byte at a time and outputs it to the console unchanged (except for line endings on legacy systems), the ÿ is read as 2 separate bytes which are output one after the other, and the terminal displays ÿ because the locale selected for the terminal also uses the UTF-8 encoding.

Adding to confusion, if the source file uses UTF-8 encoding, "ÿ" is a string of length 2 and 'ÿ' is parsed as a multibyte character constant. Multibyte character constants are very confusing and non portable (the value can be 0xC3BF or 0xBFC3 depending on the system), using them is strongly discouraged and the compiler should be configured to issue a warning when it sees one (gcc -Wall -Wextra).

Even more confusing is this: on many systems the type char signed by default. In this case, the character constant 'ÿ' (a single byte in ISO 8859-1) has a value of -1 and type int, no matter how you write it in the source code: '\377' and '\xff' will also have a value of -1. The reason for this is consistency with the value of "ÿ"[0], a char with the value -1. This is also the most common value of the macro EOF.

On all systems, getchar() and similar functions like getc() and fgetc() return values between 0 and UCHAR_MAX or the special negative value of EOF, so the byte 0xFF from a file where character ÿ in encoded as ISO 8859-1 is returned as the value 0xFF or 255, which compares different from 'ÿ' if char is signed, and also different from 'ÿ' if the source code is in UTF-8.

As a rule of thumb, do not use non-ASCII characters in character constants, do not make assumptions about the character encoding used for strings and file contents and configure the compiler to make char unsigned by default (-funsigned-char).

If you deal with foreign languages, using UTF-8 is highly recommended for all textual contents, including source code. Be aware that non-ASCII characters are encoded as multiple bytes with this encoding. Study the UTF-8 encoding, it is quite simple and elegant, and use libraries to handle textual transformations such as uppercasing.

CodePudding user response：

The issue here is that unsigned char represents an unsigned integer of size 8 bits (from 0 to 255). C uses ASCII values to represent characters. An ASCII character is simply an integer from 0 to 127. For example, A is 65.

When you use 'A', the compiler understands 65. But, 'ÿ' is not an ASCII character, it is an extended ASCII character (with a value of 152). Technically, it can fit inside an unsigned char but the C standard requires that the syntax '' contains a standard ASCII character.

So that's why the first example didn't work.

Now for the second one. A non ASCII character cannot fit into a single char. The way you can handle characters outside the limited ASCII set is by using several chars. When you write ÿ into a file, you are actually writing a binary representation of this character. If you are using the UTF-8 reprensentation, this means that in you file you have two 8-bit numbers 0xC3 and 0xBF.

When you read your file in the while loop of test2.c, at some point, c will take the value 0xC3 and then 0xBF on the next iteration. These two values will be given to putc. And then, when displayed, the two values together will be interpreted as ÿ.

When putc finally writes the characters, they eventually are read by your terminal application. If it supports UTF-8 encoding, it can understand the meaning of 0xC3 followed by 0xBF and display a ÿ.

So the reason why, in the first example, you didn't see ÿ is that the value of c in your code is actually (probably) 0xC3 which doesn't reprensent any character.

A more concrete example:

#include <stdio.h>

int main()
{
    char y[3] = { 0xC3, 0xBF, '\0' };
    printf("%s\n", y);
}

This will display ÿ but as you can see, it takes 2 chars to do that.

CodePudding user response：

if utf-8 uses the same 256 characters as ISO 8859-1. No there is a confusion here. In ISO-8859-1 (aka Latin1) the 256 characters have indeed the code point value of the corresponding Unicode character. But utf-8 have a special encoding for all characters above 0x7f and all characters having a code point between 0x80 and 0xff are represented as 2 bytes. For example the character é U 00e9 is represented as the single byte 0xe9 in ISO-8859-1, but is represented as the 2 bytes 0xc3 0xa9 in utf-8.

More references on the wikipedia page.