I was searching for a way to break down Unicode characters into hex codes for a project and I came across this piece of code in Stack overflow, and I can't understand how it works.
The original Question.
void print_characters(char const* s)
{
std::cout << std::showbase << std::hex;
for (char const* pc = s; *pc; pc) {
if (*pc & 0x80)
std::cout << (*pc & 0xff);
else
std::cout << *pc;
std::cout << ' ';
}
std::cout << std::endl;
}
and the function was called like this:
char const* test = "đ";
print_characters(test);
CodePudding user response:
This was a bit of a trick by the author of that answer.
The first thing you need to know is that UTF-8 is a superset of ASCII, and for all Unicode characters below or equal to 0x7f, UTF-8 encodes characters identically. Characters above that are split into multiple bytes. For example, the Euro sign is Unicode codepoint U 20AC. Its UTF-8 encoding is three separate bytes: e2 82 ac
. You will notice that all these bytes have their high bit set (0x80), this is a design choice of UTF-8.
What does this all have to do with the question you linked? Well, the expression *pc & 0xff
is promoted to int
instead of char
, and the implementation of operator<<(ostream&, int)
will print the character as a hex number instead of a regular character. This makes the bytes >= 0x80 stand out clearly, as the author intended.
CodePudding user response:
If you break down the expression into its individual parts that might be easier to understand.
//setup the parts
char value = 't';
char* pc = &value;
// 0x80 is the same as binary 1000 0000
unsigned mask = 0b1000000;
// break down "*pc & 0x80"
// dereference the pointer, read the value behind it
char dereferenced = *pc;
// does a bit-wise and with the mask to check if that one bit is 1 in the dereferenced value
bool has_bit_set = dereference & mask;