Home > Net >  why does the memcmp function in c use cast into unsigned char instead of just char?
why does the memcmp function in c use cast into unsigned char instead of just char?

Time:10-28

int memcmp(const void *s1, const void *s2, size_t n);

most of the implementations of memcmp function in c cast the arguments s1 and s2 into unsigned char. I am aware the the two arguments are casted to char so that memcmp compare 1 byte at a time, but why use unsigned char instead of just char ?

I tried to understand the function by reading its source code but I couldn't understand that specific part.

CodePudding user response:

When you look at a code page table for strcmp, or just plain bytes for memcmp, you would expect 0x80-0xff to compare higher than 0x00-0x7f, right?

This only works if the comparison is done unsigned. Otherwise, 0x80-0xff would (on two's-complement architectures) be negative, i.e. lower than 0x00-0x7f.

The signedness of plain char is implementation-defined. So in order to give the expected results on all platforms, the comparison cannot be done in plain char -- you need to work in unsigned char.

CodePudding user response:

The general rule of thumb when dealing with raw binary data is to always use unsigned types. This is mostly because various forms of bitwise operations come with various forms of poorly-defined behavior when handed signed or negative operands.

char is the only type in C that can be either signed or unsigned depending on the compiler, so it is also a problematic type for that reason alone. A good rule of thumb (and also a MISRA C rule) is to never use char for any other purpose than string handling. Use unsigned char or uint8_t when dealing with raw binary.

For comparisons alone, the signedness is unlikely to matter though. But you can turn the question the other way around, why would we use a signed type to compare raw binary data? There is no known sign in the data. We only end up with the sign bit set if the conversion misinterprets the raw binary as a negative integer value. And that's neither useful nor sensible behavior, so why even allow it even if it is harmless?

For the specific case of memcmp, there's also a special rule for how to calculate the result, C17 7.24.4:

The sign of a nonzero value returned by the comparison functions memcmp, strcmp, and strncmp is determined by the sign of the difference between the values of the first pair of characters (both interpreted as unsigned char) that differ in the objects being compared.

So no matter how we implement memcmp, the returned result in case of differences must be calculated using arithmetic on unsigned char.

As for how most library-quality memcmp are actually implemented, they are only likely to use unsigned char for the parts dealing with misaligned offsets of the data. The raw comparisons on aligned data is likely done with a large integer type suitable for the specific CPU, such as uint32_t or uint64_t.

  •  Tags:  
  • c c
  • Related