Home > Blockchain >  How to reencode a UTF-16 byte array as UTF-8?
How to reencode a UTF-16 byte array as UTF-8?

Time:07-31

I have a UTF-16 byte array (&[u8]) and I want to decode and reencode it to UTF-8 in Rust.

In Python I can do this:

array.decode('UTF-16', errors='ignore').encode('UTF-8')

How can I do this in Rust?

CodePudding user response:

The problem here is that UTF-16 is defined for 16-bit units, and does not specify how to convert two 8-bit units (aka bytes) into one 16-bit unit.

For that reason, I assume that you are using network endian (which is big endian). Note that this might be incorrect, because x86 processors use little endian.

So the important first step is to convert the u8s into u16. In this case I will iterate over them, convert them via u16:from_be_bytes(), and then collect them in a vector.

Then, we can use String::from_utf16() or String::from_utf16_lossy() to convert the Vec<u16> into a String.

Strings are internally represented in Rust as UTF-8. So we can then directly pull out the UTF-8 representation via .as_bytes() or .into_bytes().

fn main() {
    let utf16_bytes: &[u8] = &[
        0x00, 0x48, 0x20, 0xAC, 0x00, 0x6c, 0x00, 0x6c, 0x00, 0x6f, 0x00, 0x20, 0x00, 0x77, 0x00,
        0x6f, 0x00, 0x72, 0x00, 0x6c, 0x00, 0x64, 0x00, 0x21,
    ];

    let utf16_packets = utf16_bytes
        .chunks(2)
        .map(|e| u16::from_be_bytes(e.try_into().unwrap()))
        .collect::<Vec<_>>();

    let s = String::from_utf16_lossy(&utf16_packets);
    println!("{:?}", s);

    let utf8_bytes = s.as_bytes();
    println!("{:?}", utf8_bytes);
}
"H€llo world!"
[72, 226, 130, 172, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33]

Note that we have to use .try_into().unwrap() in our map() function. This is because .chunks_exact() doesn't let the compiler know how big the chunks are that we iterate over.

Once it is stabilized, there is the array_chunks() method which does let the compiler know, and would make this code shorter and faster. It is sadly only available in nightly right now.

#![feature(array_chunks)]

fn main() {
    let utf16_bytes: &[u8] = &[
        0x00, 0x48, 0x20, 0xAC, 0x00, 0x6c, 0x00, 0x6c, 0x00, 0x6f, 0x00, 0x20, 0x00, 0x77, 0x00,
        0x6f, 0x00, 0x72, 0x00, 0x6c, 0x00, 0x64, 0x00, 0x21,
    ];

    let utf16_packets = utf16_bytes
        .array_chunks::<2>()
        .cloned()
        .map(u16::from_be_bytes)
        .collect::<Vec<_>>();

    let s = String::from_utf16_lossy(&utf16_packets);
    println!("{:?}", s);

    let utf8_bytes = s.as_bytes();
    println!("{:?}", utf8_bytes);
}
> cargo  nightly run
"H€llo world!"
[72, 226, 130, 172, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33]

This assumes that our input is fully convertible into u16 units. In production code, a check for an uneven number of bytes would be advisable.


To write this properly with error handling, I would extract it into a method and propagate errors:

use thiserror::Error;

#[derive(Error, Debug)]
enum ParseUTF16Error {
    #[error("UTF-16 data needs to contain an even amount of bytes")]
    UnevenByteCount,
    #[error("The given data does not contain valid UTF16 data")]
    InvalidContent,
}

fn parse_utf16(data: &[u8]) -> Result<String, ParseUTF16Error> {
    let data16 = data
        .chunks(2)
        .map(|e| e.try_into().map(u16::from_be_bytes))
        .collect::<Result<Vec<_>, _>>()
        .map_err(|_| ParseUTF16Error::UnevenByteCount)?;

    String::from_utf16(&data16).map_err(|_| ParseUTF16Error::InvalidContent)
}

fn main() {
    let utf16_bytes: &[u8] = &[
        0x00, 0x48, 0x20, 0xAC, 0x00, 0x6c, 0x00, 0x6c, 0x00, 0x6f, 0x00, 0x20, 0x00, 0x77, 0x00,
        0x6f, 0x00, 0x72, 0x00, 0x6c, 0x00, 0x64, 0x00, 0x21,
    ];

    let s = parse_utf16(utf16_bytes).unwrap();
    println!("{:?}", s);

    let utf8_bytes = s.as_bytes();
    println!("{:?}", utf8_bytes);
}
"H€llo world!"
[72, 226, 130, 172, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33]
  • Related