How to compare char between 128 to 256 in ASCII in c ?-CodePudding

I create a function in c to filter some characters, however it's don't work with character between 128 to 256 in ASCII.

string parseString(string str) {
    string result = "";
    string temp = "";

    for (int i = 0; i < str.size();   i) {
        if ((str[i] >= 'a' && str[i] <= 'z') || (str[i] >= 'A' && str[i] <= 'Z') || (str[i] >= 'á' && str[i] <= 'û') || (str[i] >= 160 && str[i] <= 165) || (str[i] >= 198 && str[i] <= 199) || str[i] == 39) {
            result  = tolower(str[i]);
        }
    }

    return result;
}

Some examples:

parseString('word@#$%¨$%#$@#%$'); //return word
// however
parseString('Fréderic'); //return Frederic, however the function don't filter 130 character

How can I use AscII 256 in c ?

CodePudding user response：

The type char is used as the elements of std::string. Whether char is signed depends on the environment. You should cast the value to unsigned char before comparision.

string parseString(string str) {
    string result = "";
    string temp = "";

    for (int i = 0; i < str.size();   i) {
        unsigned char c = static_cast<unsigned char>(str[i]);
        if ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || (c >= 'á' && c <= 'û') || (c >= 160 && c <= 165) || (c >= 198 && c <= 199) || c == 39) {
            result  = tolower(c);
        }
    }

    return result;
}

Then, single quotations '' is for character constants in C . You should use double quotations "" to express strings.

parseString("word@#$%¨$%#$@#%$");
parseString("Fréderic");

Even after this change, your code (especially the part (c >= 'á' && c <= 'û') may not work if you are using character set that uses multiple bytes to express á and û (for example, UTF-8).

CodePudding user response：

The problem of comparing non-ascii characters is challenging. You have to account for the following issues:

There are different encodings (ASCII is not one of them) that can encode the character 'é'. Possible are for example:
- in ISO/IEC 8859 as 1 byte 0xe9
- in Unicode UTF-16 as 2 bytes 0x00 0xe9
- in Unicode UTF-8 as 2 bytes 0xc3 0xa9
In Unicode, which is most likely what you are using, there are even more than one possible code points for é:
- From the Latin-1 Supplement block:
  U 0039 (0x00 0xe9 in UTF-16 or 0xc3 0xa9 in UTF-8)
- A normal e combined with a diacritical mark:
  U 0065 U 0301 (0x00 0x65 0x03 0x01 in UTF-16 or 0x65 0xcc 0x81 in UTF-8)

The first problem is easy to take care of. You need to find out what character encoding your text editor is using to write the é to the source file. And you have to adjust your algorithm to work with multibyte characters. For UTF-8 for example each multibyte character starts with a binary 0b11... byte and ends with an 0b10... byte. The single character bytes are 0b0....

The second problem is only a problem if you work with user input. Unicode defines a procedure known as normalization, that transforms equivalent code point sequences (like the both ways to represent 'é' above) into a canonical form, that can be used for comparison.

If you now think (like I would), that this is way to much complicated stuff, I would recommend using a string library that is able to deal with those kind of things properly. A starting could be the answers to this question.

CodePudding user response：

There is no é in ASCII. ASCII stops at 127.

But in Unicode é is 233. That might explain why your filter fails. But to really understand what is happening you need to know what encoding your compiler is using.

Try this code

cout << (int)'é' << '\n';

and see what number it prints.