I am building a string library to support both ascii and utf8.
I create two typedef for t_ascii
and t_utf8
. ascii is safe to be read as utf8, but utf8 is not safe to be read as ascii.
Do I have any way to issue a warning when implicitely casting from t_utf8
to t_ascii
, but not when implicitely casting t_ascii
to t_utf8
?
Ideally, I would want these warnings (and only these warnings) to be issued:
#include <stdint.h>
typedef char t_ascii;
typedef uint_least8_t t_utf8;
int main()
{
t_ascii const* asciistr = "Hello world"; // Ok
t_utf8 const* utf8str = "你好世界"; // Ok
asciistr = utf8str; // Warning: utf8 to ascii is not safe
utf8str = asciistr; // Ok: ascii to utf8 is safe
t_ascii asciichar = 'A';
t_utf8 utf8char = 'B';
asciichar = utf8char; // Warning: utf8 to ascii is not safe
utf8char = asciichar; // Ok: ascii to utf8 is safe
}
Currently, when building with -Wall (and even with -funsigned-char
), I get these warnings:
gcc main.c -Wall -Wextra
main.c: In function ‘main’:
main.c:10:35: warning: pointer targets in initialization of ‘const t_utf8 *’ {aka ‘const unsigned char *’} from ‘char *’ differ in signedness [-Wpointer-sign]
10 | t_utf8 const* utf8str = "你好世界"; // Ok
| ^~~~~~~~~~
main.c:12:18: warning: pointer targets in assignment from ‘const t_utf8 *’ {aka ‘const unsigned char *’} to ‘const t_ascii *’ {aka ‘const char *’} differ in signedness [-Wpointer-sign]
12 | asciistr = utf8str; // Warning: utf8 to ascii is not safe
| ^
main.c:16:17: warning: pointer targets in assignment from ‘const t_ascii *’ {aka ‘const char *’} to ‘const t_utf8 *’ {aka ‘const unsigned char *’} differ in signedness [-Wpointer-sign]
16 | utf8str = asciistr; // Ok: ascii to utf8 is safe
| ^
CodePudding user response:
Compile with -Wall
. Always compile with -Wall
.
<user>@squall:~/src/p1$ gcc -Wall -c test2.c
test2.c: In function ‘main’:
test2.c:9:31: warning: pointer targets in initialization of ‘const t_utf8 *’ {aka ‘const signed char *’} from ‘char *’ differ in signedness [-Wpointer-sign]
9 | t_utf8 const* utf8str = "你好世界";
| ^~~~~~~~~~~~~~
test2.c:11:13: warning: pointer targets in assignment from ‘const t_ascii *’ {aka ‘const char *’} to ‘const t_utf8 *’ {aka ‘const signed char *’} differ in signedness [-Wpointer-sign]
11 | utf8str = asciistr; // Ok: ascii to utf8 is safe
| ^
test2.c:12:14: warning: pointer targets in assignment from ‘const t_utf8 *’ {aka ‘const signed char *’} to ‘const t_ascii *’ {aka ‘const char *’} differ in signedness [-Wpointer-sign]
12 | asciistr = utf8str; // Should issue warning: utf8 to ascii is not safe
| ^
You want it to be safe to cast from t_ascii
from t_utf8
, but it's simply not. The signedness differs.
The warning is not about the fact that valid utf8 is sometimes not valid ASCII - the compiler knows nothing about that. The warning is about the sign.
If you want an unsigned char
, compile with -funsigned-char
. But then neither warning will be issued.
(By the way, if you think that type int_least8_t
will be able to hold a multibyte char / complete utf8 codepoint encoding - it will not. All int_least8_t
and consequently utf8_t
in a single compilation unit will have the exact same size.)
CodePudding user response:
Simply compile it with a standard C compiler. What compiler options are recommended for beginners learning C?
Result:
<source>: In function 'main':
<source>:9:31: error: pointer targets in initialization of 'const t_utf8 *' {aka 'const unsigned char *'} from 'char *' differ in signedness [-Wpointer-sign]
9 | t_utf8 const* utf8str = "你好世界"; // Ok
| ^~~~~~~~~~
<source>:11:14: error: pointer targets in assignment from 'const t_utf8 *' {aka 'const unsigned char *'} to 'const t_ascii *' {aka 'const char *'} differ in signedness [-Wpointer-sign]
11 | asciistr = utf8str; // Warning: utf8 to ascii is not safe
| ^
<source>:12:13: error: pointer targets in assignment from 'const t_ascii *' {aka 'const char *'} to 'const t_utf8 *' {aka 'const unsigned char *'} differ in signedness [-Wpointer-sign]
12 | utf8str = asciistr; // Ok: ascii to utf8 is safe
| ^
but not when implicitely casting t_ascii to t_utf8 ?
No you can't have that in standard C, since it's an invalid pointer conversion. You can silence the compiler with an explicit cast, but you are invoking undefined behavior if you do.
Apart from that, you could use C11 _Generic
to find out which type uint_least8_t
boils down to:
#include <stdint.h>
#include <stdio.h>
#define what_type(obj) printf("%s is same as %s\n", #obj, \
_Generic ((obj), \
char: "char", \
unsigned char: "unsigned char", \
signed char: "signed char") );
int main (void)
{
typedef char t_ascii;
typedef uint_least8_t t_utf8;
t_ascii ascii;
t_utf8 utf8;
what_type(ascii);
what_type(utf8);
}
Output on gcc x86 Linux:
ascii is same as char
utf8 is same as unsigned char