Home > Blockchain >  Proper way to perform unsigned<->signed conversion
Proper way to perform unsigned<->signed conversion

Time:03-22

Context

I have a char variable on which I need to apply a transformation (for example, add an offset). The result of the transformation may or may not overflow.
I don't really care of the actual value of the variable after the transformation is performed.
The only guarantee I want to have is that I must be able to retrieve the original value if I perform the transformation again but in the opposite way (for example, substract the offset).

Basically:

char a = 42;
a  = 140; // overflows (undefined behaviour)
a -= 140; // must be equal to 42

Problem

I know that signed types overflow is undefined behaviour but it's not the case for unsigned types overflows. I have then chosen to add an intermediate step in the process to perform the conversion.

It would then become:

  1. char -> unsigned char conversion
  2. Apply the tranformation (resp. the reversed transformation)
  3. unsigned char -> char conversion

This way, I have the garantee that the potential overflow will only occur for an unsigned type.

Question

My question is, what is the proper way to perform such a conversion ?

Three possibilities come in my mind. I can either:

  • implicit conversion
  • static_cast
  • reinterpret_cast

Which one is valid (not undefined behaviour) ? Which one should I use (correct behaviour) ?

My guess is that I need to use reinterpret_cast since I don't care of actual value, the only guarantee I want is that the value in memory remains the same (i.e. the bits don't change) so that it can be reversible.

On the other hand, I'm not sure if the implicit conversion or the static_cast won't trigger undefined behaviour in the case where the value is not representable in the destination type (out of range).

I couldn't find anything explicitly stating it is or is not undefined behaviour, I just found this Microsoft documentation where they did it with implicit conversions without any mention of undefined behaviour.


Here is an example, to illustrate:

char a = -4;                                             // out of unsigned char range
unsigned char b1 = a;                                    // (A)
unsigned char b2 = static_cast<unsigned char>(a);        // (B)
unsigned char b3 = reinterpret_cast<unsigned char&>(a);  // (C)

std::cout << (b1 == b2 && b2 == b3) << '\n';

unsigned char c = 252;                                   // out of (signed) char range
char d1 = c;                                             // (A')
char d2 = static_cast<char>(c);                          // (B')
char d3 = reinterpret_cast<char&>(c);                    // (C')

std::cout << (d1 == d2 && d2 == d3) << '\n';

The output is:

true
true

Unless undefined behaviour is triggered, the three methods seem to work.

Are (A) and (B) (resp. (A') and (B')) undefined behaviour if the value is not representable in the destination type ?

Is (C) (resp. (C')) well defined ?

CodePudding user response:

I know that signed types overflow is undefined behaviour,

True, but does not apply here.

a = 140; is not signed integer overflow, not UB. That is like a = a 140; a 140 does not overflow when a is 8-bit signed char or unsigned char.

The issue is what happens when the sum a 140 is out of char range and assigned to a char.

Otherwise, the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised. C17dr § 6.3.1.3 3

It is implementation defined behavior, when char is signed and 8-bit - to assign a value outside the char range.

Usually the implementation defined behavior is a wrap and fully defined so a = 140; is fine as is.

Alternatively the implementation defined behavior might have been to cap the value to the char range when char is signed.

char a = 42;
a  = 140;
// Might act as if
a = max(min(a   140, CHAR_MAX), CHAR_MIN);
a = 127;   

To avoid implementation defined behavior, perform the or - on a accessed as a unsigned char

*((unsigned char *)&a)  = small_offset;

Or just use unsigned char a and avoid all this. unsigned char is defined to wrap.

CodePudding user response:

For full portability, you do have a small problem insofar as (except for char1) signed data types have not been2 required to have as many distinct values as their unsigned counterparts. Very few systems actually used sign-magnitude representation for integral types, but if you cannot rule them out, then simply doing the math in the unsigned counterpart does not actually guarantee round-tripping, even if you use numeric_limits<?>::min() to try to avoid conversion of unrepresentable values.

With that caveat out of the way, the direct answer to your question is that both implicit conversion and static_cast are correct (and equivalent) for converting a value between its signed and unsigned counterpart types. In the signed->unsigned direction, the behavior is well-defined by the Standard, while in the other direction the behavior is implementation-defined.


1 char and signed char themselves are rescued from this possibility by their endorsement for access to the byte representation of any object, including to unsigned objects which are required not to have any missing values.

2 Two's complement conversion behavior is required in the latest version of C , see https://eel.is/c draft/basic.fundamental#3

  • Related