What does *pointer & 0x80 mean?-CodePudding

I was searching for a way to break down Unicode characters into hex codes for a project and I came across this piece of code in Stack overflow, and I can't understand how it works.

The original Question.

void print_characters(char const* s)
{
   std::cout << std::showbase << std::hex;
   for (char const* pc = s; *pc;   pc) {
       if (*pc & 0x80)
           std::cout << (*pc & 0xff);
       else
           std::cout << *pc;
       std::cout << ' ';
   }
   std::cout << std::endl;
}

and the function was called like this:

    char const* test = "đ";
    print_characters(test);

CodePudding user response：

This was a bit of a trick by the author of that answer.

The first thing you need to know is that UTF-8 is a superset of ASCII, and for all Unicode characters below or equal to 0x7f, UTF-8 encodes characters identically. Characters above that are split into multiple bytes. For example, the Euro sign is Unicode codepoint U 20AC. Its UTF-8 encoding is three separate bytes: e2 82 ac. You will notice that all these bytes have their high bit set (0x80), this is a design choice of UTF-8.

What does this all have to do with the question you linked? Well, the expression *pc & 0xff is promoted to int instead of char, and the implementation of operator<<(ostream&, int) will print the character as a hex number instead of a regular character. This makes the bytes >= 0x80 stand out clearly, as the author intended.

CodePudding user response：

If you break down the expression into its individual parts that might be easier to understand.

//setup the parts
char value = 't';
char* pc = &value;
// 0x80 is the same as binary 1000 0000
unsigned mask = 0b1000000;  

// break down "*pc & 0x80"

// dereference the pointer, read the value behind it
char dereferenced = *pc;

// does a bit-wise and with the mask to check if that one bit is 1 in the dereferenced value
bool has_bit_set = dereference & mask;