I have observed that when a uint8_t
type buffer (not guaranteed to be null terminated) is read into a stringstream
with the <<
operator using ss << buff.data
, and the contained std::string
is returned to Python, Python throws an error:
UnicodeDecodeError: 'utf-8' codec can't decode byte
But, if I use ss.write(buff.data, buff.size)
, the issue is not there.
I assume that this issue is because when using <<
, there is a buffer overrun, and the data in ss
might not be UTF-8 anymore. But when I do write()
, I define the size and so there is no possibility of garbage data.
What is surprising is that if I do ss.write(buff.data, buff.size 1)
, I always observe a segfault. So I can't figure out how <<
can do a buffer overrun? Is there a fundamental difference between how both of these work, and so one triggers a segfault when it makes an illegal buffer access, and the other one does not? Or, is <<
just getting lucky?
CodePudding user response:
uint8_t
is an alias for unsigned char
. When operator<<
is given an unsigned char*
pointer, it is treated as a null-terminated string, same as a char*
pointer. So, if your data is not actually a null-terminated character string, writing it to the stream using operator<<
is undefined behavior. The code may crash. It may write garbage to the stream. There is no way to know.
write()
doesn't care about a null terminator. It writes exactly as many bytes as you specify. That is why you don't have any trouble when using write()
instead of operator<<
.