How to communicate with C stdout?-CodePudding

How can I tell which way my terminal expects to received data? Should I use fputc, fputwc, or something else?

Here is what I've tried with fputc:

// characters.c
#include <ctype.h>
#include <stdio.h>
int main ( void )
{
    for ( int c = 0 ; c <= 65535 ; c    )
            if ( isgraph ( c ) ) fputc ( c , stdout ) ;
}

And with fputwc:

// wcharacters.c
#include <ctype.h>
#include <stdio.h>
#include <wcahr.h>
int main ( void )
{
    for ( wchar_t c = 0 ; c <= 65535 ; c    )
            if ( isgraph ( c ) ) fputwc ( c , stdout ) ;
}

With characters.c, the output is consistent with the ASCII characters for the first few items printed on the terminal (!"#, etc.), then, after ~, it's garbage.

I get similar results with wcharacters.c, except that instead of garbage it's ?'s or some other ASCII character (but with the great majority being ?'s).

I know the same terminal supports many character representations in in the Unicode code point range 33 to 65535 (decimal), as I can print many of those characters using Python 3.

I am using Trisquel GNU/Linux, gcc (-std=c99), and the "MATE Terminal." The font is "Monospace," which sounds generic, but it seems to support many more characters beyond the ASCII range.

I am open to using a more recent C standard. At the same time, one of the points of the project I am working on is simplicity, and the C standard seems to become more and more complicated with each successive standard (would using C17's uchar.h be simpler?).

With the function I am working on, the user should be able to write any of the graphic characters (as in isgraph) to stdout and have the appearance of those characters be "correct."

CodePudding user response：

How can I tell which way my terminal expects to received data?

isgraph() OK with narrow characters, not for wide ones.

Best to use a restricted range with is...() functions, not [0 ... 65535].

In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavior is undefined. C17dr § 7.4 1

is...() is not useful for values outside that range.

Aside: for ( wchar_t c = 0 ; c <= 65535 ; c ) risks an infinite loop when wchar_t is 16-bit unsigned.

GTG.

CodePudding user response：

You could not print the unicode point directly, most terminal only accept UTF8 encoded string by default.

#include <stdio.h>
#include <stdint.h>

/**
 * Encode a code point using UTF-8
 *
 * @author Ondřej Hruška <[email protected]>
 * @license MIT
 *
 * @param out - output buffer (min 5 characters), will be 0-terminated
 * @param utf - code point 0-0x10FFFF
 * @return number of bytes on success, 0 on failure (also produces U FFFD, which uses 3 bytes)
 */
int utf8_encode(char *out, uint32_t utf)
{
  if (utf <= 0x7F) {
    // Plain ASCII
    out[0] = (char) utf;
    out[1] = 0;
    return 1;
  }
  else if (utf <= 0x07FF) {
    // 2-byte unicode
    out[0] = (char) (((utf >> 6) & 0x1F) | 0xC0);
    out[1] = (char) (((utf >> 0) & 0x3F) | 0x80);
    out[2] = 0;
    return 2;
  }
  else if (utf <= 0xFFFF) {
    // 3-byte unicode
    out[0] = (char) (((utf >> 12) & 0x0F) | 0xE0);
    out[1] = (char) (((utf >>  6) & 0x3F) | 0x80);
    out[2] = (char) (((utf >>  0) & 0x3F) | 0x80);
    out[3] = 0;
    return 3;
  }
  else if (utf <= 0x10FFFF) {
    // 4-byte unicode
    out[0] = (char) (((utf >> 18) & 0x07) | 0xF0);
    out[1] = (char) (((utf >> 12) & 0x3F) | 0x80);
    out[2] = (char) (((utf >>  6) & 0x3F) | 0x80);
    out[3] = (char) (((utf >>  0) & 0x3F) | 0x80);
    out[4] = 0;
    return 4;
  }
  else {
    // error - use replacement character
    out[0] = (char) 0xEF;
    out[1] = (char) 0xBF;
    out[2] = (char) 0xBD;
    out[3] = 0;
    return 0;
  }
}

int main ( void )
{
    char out[6];
    for ( uint32_t c = 0 ; c <= 65535 ; c    ) {
        utf8_encode(out, c);
        printf("%s", out);
    }
}

CodePudding user response：

How to communicate with C stdout?

It is very implementation defined.

I am using ... GNU/Linux

To communicate with Linux terminal driver, you use ioctls or termios.h libraries. To communicate with the real terminal, you can set the output to be raw and send a special escape sequence, and then you get answer to stdin. For example ESC[6n request cursor position reports as ESC[#;#R).

How can I tell which way my terminal expects to received data?

From locale. For example LC_ALL=en_US.UTF-8 means it wants in UTF-8, LC_ALL=tr_TR.ISO-8859-9 means it wants characters in ISO-8859-9 encoding.

Note that from cppreference setlocale:

During program startup, the equivalent of setlocale(LC_ALL, "C"); is executed before any user code is run.

It's typical to do setlocale(LC_ALL, "") right after main to use terminal's locale.

Should I use fputc, fputwc, or something else?

Does not matter, but you should use one of them, once the stream is switched it's hard to switch it back. With fputc you can output the same multibyte characters as wide characters with fputwc, but surely, there may be a situation where there is no multibyte character for wide character. On Linux with UTF-8 locale, with fputc you can output the same UTF-8 code points as with fputwc with UTF-32. GLIBC Extended char intro is a good read.

When you use fputwc, behind the scenes glibc converts UTF-32 to locale specific encoding, somewhere about behind here initialized here. It basically calls iconv for each character.

after ~, it's garbage.

isgraph ( c ) is invalid for c greater than UCHAR_MAX and c different from EOF. #include <wcahr.h> is invalid.

I have locale LC_CTYPE=pl_PL.UTF-8. The following program:

#include <ctype.h>
#include <locale.h>
#include <stdio.h>
#include <wchar.h>
#include <wctype.h>
int main() {
    setlocale(LC_ALL, "");
    for (wchar_t c = 63000; c <= 65535; c  )
        if (iswgraph(c))
            fputwc(c, stdout);
}

will output some graphic characters:

...

As mentioned, you can do the same with fputc if you wanna:

#include <ctype.h>
#include <locale.h>
#include <stdio.h>
#include <wchar.h>
#include <wctype.h>
#include <stdlib.h>
#include <limits.h>
int main() {
    setlocale(LC_ALL, "");
    for (wchar_t c = 63000; c <= 65535; c  ) {
        if (iswgraph(c)) {
            char s[MB_LEN_MAX];
            wctomb(s, c);
            for (char *p = s; *p;   p) {
                fputc(*p, stdout);
            }
        }
    }
}

except that instead of garbage it's ?'s or som

Your program is using C locale. C locale can't represent anything other than ASCII.

would using C17's uchar.h be simpler?

Use wchar_t. It depends on your specific application. In the usual cases, let's trust C standard and C implementations and operating systems and use wchar_t as the "ultimate" character.