Why can't I print the decimal value of a extended ASCII char like 'Ç'? in C-CodePudding

First, in this C project we have some conditions as far as writing code: I can´t declare a variable and attribute a value to it on the same line of code and we are only allowed to use while loops. Also, I'm using Ubuntu for reference.

I want to print the decimal ASCII value, character by character, of a string passed to the program. For e.g. if the input is "rose", the program correctly prints 114 111 115 101. But when I try to print the decimal value of a char like a 'Ç', the first char of the extended ASCII table, the program weirdly prints -61 -121. Here is the code:

int main (int argc, char **argv)
{  
    int i;

    i = 0;
    if (argc == 2)
    {
        while (argv[1][i] != '\0')
        {
            printf ("%i ", argv[1][i]);
            i  ;
        }
    }
}

I did some research and found that i should try unsigned char argv instead of char, like this:

int main (int argc, unsigned char **argv)
{  
    int i;

    i = 0; 
    if (argc == 2)
    {
        while (argv[1][i] != '\0')
        {
            printf("%i ", argv[1][i]);
            i  ;
        }
    }
}

In this case, I run the program with a 'Ç' and the output is 195 135 (still wrong).

How can I make this program print the right ASCII decimal value of a char from the extended ASSCCI table, in this case a "Ç" should be a 128.

Thank you!!

CodePudding user response：

Your platform is using UTF-8 Encoding.

Unicode Latin Capital Letter C with Cedilla (U 00C7) "Ç" encodes to 0xC3 0x87 in UTF-8.

In turn those bytes in decimal are 195 and 135 which you see in output.

Remember UTF-8 is a multi-byte encoding for characters outside basic ASCII (0 thru 127). That character is code-point 128 in extended ASCII but UTF-8 diverges from Extend ASCII in that range.

You may find there's tools on your platform to convert that to extended ASCII but I suspect you don't want to do that and should work with the encoding supported by your platform (which I am sure is UTF-8).

It's Unicode Code Point 199 so unless you have a specific application for Extended ASCII you'll probably just make things worse by converting to it. That's not least because it's a much smaller set of characters than Unicode.

Here's some information for Unicode Latin Capital Letter C with Cedilla including the UTF-8 Encoding: https://www.fileformat.info/info/unicode/char/00C7/index.htm

CodePudding user response：

There are various ways of representing non-ASCII characters, such as Ç. Your question suggests you're familiar with 8-bit character sets such as ISO-8859, where in several of its variants Ç does indeed have code 199. (That is, if your computer were set up to use ISO-8859, your program probably would have worked, although it might have printed -57 instead of 199.)

But these days, more and more systems use Unicode, which they typically encode using a particular multibyte encoding, UTF-8.

In C, one way to extract wide characters from a multibyte character string is the function mbtowc. Here is a modification of your program, using this function:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <wchar.h>
#include <locale.h>

int main (int argc, char **argv)
{  
    setlocale(LC_CTYPE, "");

    if (argc == 2)
    {
        char *p = argv[1];
        int n;
        wchar_t wc;
        while((n = mbtowc(&wc, p, strlen(p))) > 0)
        {
            printf ("%lc: %d (%d)\n", wc, wc, n);
            p  = n;
        }
    }
}

You give mbtowc a pointer to the multibyte encoding of one or more multibyte characters, and it converts one of them, returning it via its first argument — here, into the variable wc. It returns the number of multibyte characters it used, or 0 if it encountered the end of the string.

When I run this program on the string abÇd, it prints

a: 97 (1)
b: 98 (1)
Ç: 199 (2)
d: 100 (1)

This shows that in Unicode (just like 8859-1), Ç has the code 199, but it takes two bytes to encode it.

Under Linux, at least, the C library supports potentially multiple multibyte encodings, not just UTF-8. It decides which encoding to use based on the current "locale", which is usually part fo the environment, literally governed by an environment variable such as $LANG. That's what the call setlocale(LC_CTYPE, "") is for: it tells the C library to pay attention to the environment to select a locale for the program's functions, like mbtowc, to use.

Unicode is of course huge, encoding thousands and thousands of characters. Here's the output of the modified version of your program on the string "abΣ∫