I create a function in c to filter some characters, however it's don't work with character between 128 to 256 in ASCII.
string parseString(string str) {
string result = "";
string temp = "";
for (int i = 0; i < str.size(); i) {
if ((str[i] >= 'a' && str[i] <= 'z') || (str[i] >= 'A' && str[i] <= 'Z') || (str[i] >= 'á' && str[i] <= 'û') || (str[i] >= 160 && str[i] <= 165) || (str[i] >= 198 && str[i] <= 199) || str[i] == 39) {
result = tolower(str[i]);
}
}
return result;
}
Some examples:
parseString('word@#$%¨$%#$@#%$'); //return word
// however
parseString('Fréderic'); //return Frederic, however the function don't filter 130 character
How can I use AscII 256 in c ?
CodePudding user response:
The type char
is used as the elements of std::string
. Whether char
is signed depends on the environment. You should cast the value to unsigned char
before comparision.
string parseString(string str) {
string result = "";
string temp = "";
for (int i = 0; i < str.size(); i) {
unsigned char c = static_cast<unsigned char>(str[i]);
if ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || (c >= 'á' && c <= 'û') || (c >= 160 && c <= 165) || (c >= 198 && c <= 199) || c == 39) {
result = tolower(c);
}
}
return result;
}
Then, single quotations ''
is for character constants in C . You should use double quotations ""
to express strings.
parseString("word@#$%¨$%#$@#%$");
parseString("Fréderic");
Even after this change, your code (especially the part (c >= 'á' && c <= 'û')
may not work if you are using character set that uses multiple bytes to express á
and û
(for example, UTF-8).
CodePudding user response:
The problem of comparing non-ascii characters is challenging. You have to account for the following issues:
There are different encodings (ASCII is not one of them) that can encode the character
'é'
. Possible are for example:- in ISO/IEC 8859 as 1 byte
0xe9
- in Unicode UTF-16 as 2 bytes
0x00 0xe9
- in Unicode UTF-8 as 2 bytes
0xc3 0xa9
- in ISO/IEC 8859 as 1 byte
In Unicode, which is most likely what you are using, there are even more than one possible code points for
é
:- From the Latin-1 Supplement block:
U 0039 (0x00 0xe9
in UTF-16 or0xc3 0xa9
in UTF-8) - A normal
e
combined with a diacritical mark:
U 0065 U 0301 (0x00 0x65 0x03 0x01
in UTF-16 or0x65 0xcc 0x81
in UTF-8)
- From the Latin-1 Supplement block:
The first problem is easy to take care of. You need to find out what character encoding your text editor is using to write the é
to the source file. And you have to adjust your algorithm to work with multibyte characters. For UTF-8 for example each multibyte character starts with a binary 0b11...
byte and ends with an 0b10...
byte. The single character bytes are 0b0...
.
The second problem is only a problem if you work with user input. Unicode defines a procedure known as normalization, that transforms equivalent code point sequences (like the both ways to represent 'é' above) into a canonical form, that can be used for comparison.
If you now think (like I would), that this is way to much complicated stuff, I would recommend using a string library that is able to deal with those kind of things properly. A starting could be the answers to this question.
CodePudding user response:
There is no é in ASCII. ASCII stops at 127.
But in Unicode é is 233. That might explain why your filter fails. But to really understand what is happening you need to know what encoding your compiler is using.
Try this code
cout << (int)'é' << '\n';
and see what number it prints.