I am trying to read non-printable characters from a text file, print out the characters' ASCII code, and finally write these non-printable characters into an output file.
However, I have noticed that for every non-printable character I read, there is always an extra non-printable character existing in front of what I really want to read.
For example, the character I want to read is "§". And when I print out its ASCII code in my program, instead of printing just "167", it prints out "194 167".
I looked it up in the debugger and saw "§" in the char array. But I don't have  anywhere in my input file. screenshot of debugger
And after I write the non-printable character into my output file, I have noticed that it is also just "§", not "§".
There is an extra character being attached to every single non-printable character I read. Why is this happening? How do I get rid of it?
Thanks!
Code as follows:
case 1:
mode = 1;
FILE *fp;
fp = fopen ("input2.txt", "r");
int charCount = 0;
while(!feof(fp)) {
original_message[charCount] = fgetc(fp);
charCount ;
}
original_message[charCount - 1] = '\0';
fclose(fp);
k = strlen(original_message);//split the original message into k input symbols
printf("k: \n%lld\n", k);
printf("ASCII code:\n");
for (int i = 0; i < k; i )
{
ASCII = original_message[i];
printf("%d ", ASCII);
}
CodePudding user response:
C's getchar
(and getc
and fgetc
) functions are designed to read individual bytes. They won't directly handle "wide" or "multibyte" characters such as occur in the UTF-8 encoding of Unicode.
But there are other functions which are specifically designed to deal with those extended characters. In particular, if you wish, you can replace your call to fgetc(fp)
with fgetwc(fp)
, and then you should be able to start reading characters like §
as themselves.
You will have to #include <wchar.h>
to get the prototype for fgetwc
. And you may have to add the call
setlocale(LC_CTYPE, "");
at the top of your program to synchronize your program's character set "locale" with that of your operating system.
Not your original code, but I wrote this little program:
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main()
{
wchar_t c;
setlocale(LC_CTYPE, "");
while((c = fgetwc(stdin)) != EOF)
printf("%lc %d\n", c, c);
}
When I type "A", it prints A 65
.
When I type "§", it prints § 167
.
When I type "Ƶ", it prints Ƶ 437
.
When I type "†", it prints † 8224
.
Now, with all that said, reading wide characters using functions like fgetwc
isn't the only or necessarily even the best way of dealing with extended characters. In your case, it carries a number of additional consequences:
- Your
original_message
array is going to have to be an array ofwchar_t
, not an array ofchar
. - Your
original_message
array isn't going to be an ordinary C string — it's a "wide character string". So you can't callstrlen
on it; you're going to have to callwcslen
. - Similarly, you can't print it using
%s
, or its characters using%c
. You'll have to remember to use%ls
or%lc
.
So although you can convert your entire program to use "wide" strings and "w
" functions everywhere, it's a ton of work. In many cases, and despite anomalies like the one you asked about, it's much easier to use UTF-8 everywhere, since it tends to Just Work. In particular, as long as you don't have to pick a string apart and work with its individual characters, or compute the on-screen display length of a string (in "characters") using strlen
, you can just use plain C strings everywhere, and let the magic of UTF-8 sequences take care of any non-ASCII characters your users happen to enter.