Home > Mobile >  How to convert a string to file with UTF-32LE encoding in JS?
How to convert a string to file with UTF-32LE encoding in JS?

Time:11-10

Based on this thread I tried to create a blob with utf 32 encoding and BOM of FF FE 00 00(UTF-32LE representation) as follows:

var BOM = new Uint8Array([0xFF,0xFE,0x00,0x00]);
var b = new Blob([ BOM, "➀➁➂ Test" ]);
var url = URL.createObjectURL(b);
open(url);

But the file content gets corrupted and gets changed to a different language. What is the correct way to convert a string to a file with utf-32le format?

Edit: Im trying this in browser environment

CodePudding user response:

Note: I'm assuming you're doing this in a browser, since you used Blob and Node.js only recently got Blob support, and you referenced a question doing this in a browser.

You're just setting the BOM, you're not handling converting the data. As it says in MDN's documentation, any strings in the array will be encoded using UTF-8. So you have a UTF-32LE BOM followed by UTF-8 data.

Surprisingly (to me), the browser platform doesn't seem to have a general-purpose text encoder (TextEncoder just encodes UTF-8), but JavaScript strings provide a means of iterating through their code points (not just code units) and getting the actual Unicode code point value. (If those terms are unfamiliar, my blog post What is a string? may help.) So you can get that number and convert it into four little-endian bytes. DataView provides a convenient way to do that.

Finally, you'll want to specify the charset in the blob's MIME type (the BOM itself isn't sufficient). I expected text/plain; charset=UTF-32LE to work, but it doesn't, at least not in Chromium browsers. There's probably some legacy reason for that. text/html works (on its own), but we may as well be specific and do text/html; charset=UTF-32LE.

See comments:

function getUTF32LEUrl(str) {
    // The UTF-32LE BOM
    const BOM = new Uint8Array([0xFF,0xFE,0x00,0x00]);
    // A byte array and DataView to use when converting 32-bit LE to bytes;
    // they share an underlying buffer
    const uint8 = new Uint8Array(4);
    const view = new DataView(uint8.buffer);
    // Convert the payload to UTF-32LE
    const utf32le = Array.from(str, (ch) => {
        // Get the code point
        const codePoint = ch.codePointAt(0);
        // Write it as a 32-bit LE value in the buffer
        view.setUint32(0, codePoint, true);
        // Read it as individual bytes and create a plain array of them
        return Array.from(uint8);
    }).flat(); // Flatten the array of arrays into one long byte sequence
    // Create the blob and URL
    const b = new Blob(
        [ BOM, new Uint8Array(utf32le)],
        { type: "text/html; charset=UTF-32LE"} // Set the MIME type
    );
    const url = URL.createObjectURL(b);
    return url;
}

Beware, though, that the specification "prohibits" browsers from supporting UTF-32 (either LE or BE) for HTML:

13.2.3.3 Character encodings

User agents must support the encodings defined in Encoding, including, but not limited to, UTF-8, ISO-8859-2, ISO-8859-7, ISO-8859-8, windows-874, windows-1250, windows-1251, windows-1252, windows-1254, windows-1255, windows-1256, windows-1257, windows-1258, GBK, Big5, ISO-2022-JP, Shift_JIS, EUC-KR, UTF-16BE, UTF-16LE, UTF-16BE/LE, and x-user-defined. User agents must not support other encodings.

Note: The above prohibits supporting, for example, CESU-8, UTF-7, BOCU-1, SCSU, EBCDIC, and UTF-32. This specification does not make any attempt to support prohibited encodings in its algorithms; support and use of prohibited encodings would thus lead to unexpected behavior. [CESU8] [UTF7] [BOCU1] [SCSU]

You might be better off with one of the UTF-16s, given that you're using window.open to open the result. (For downloading, UTF-32 is fine if you really want a UTF-32 file.)


Sadly, Stack Snippets won't let you open a new window, but here's a full example you can copy and paste to run locally:

<!doctype html>
<html>
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=Edge">
    <title>UTF-32 Conversion</title>
    <link rel="shortcut icon" href="favicon.png">
    <style>
    body, html {
        height: 100%;
        width: 100%;
        margin: 0;
        padding: 0;
        box-sizing: border-box;
    }
    *, *:before, *:after {
        box-sizing: inherit;
    }
    </style>
</head>
<body>
<input type="button" value="Open" id="open">
<input type="button" value="Download" id="download">
<script type="module">
function getUTF32LEUrl(str, mimeType) {
    // The UTF-32LE BOM
    const BOM = new Uint8Array([0xFF,0xFE,0x00,0x00]);
    // A byte array and DataView to use when converting 32-bit LE to bytes;
    // they share an underlying buffer
    const uint8 = new Uint8Array(4);
    const view = new DataView(uint8.buffer);
    // Convert the payload to UTF-32LE
    const utf32le = Array.from(str, (ch) => {
        // Get the code point
        const codePoint = ch.codePointAt(0);
        // Write it as a 32-bit LE value in the buffer
        view.setUint32(0, codePoint, true);
        // Read it as individual bytes and create a plain array of them
        return Array.from(uint8);
    }).flat(); // Flatten the array of arrays into one long byte sequence
    // Create the blob and URL
    const b = new Blob(
        [ BOM, new Uint8Array(utf32le)],
        mimeType // Set the MIME type
    );
    const url = URL.createObjectURL(b);
    return url;
}
document.getElementById("open").addEventListener("click", () => {
    const str = "➀➁➂ Test";
    const url = getUTF32LEUrl(str, { type: "text/html; charset=UTF-32LE" });
    window.open(url);
});
document.getElementById("download").addEventListener("click", () => {
    const str = "➀➁➂ Test";
    const url = getUTF32LEUrl(str, { type: "text/plain; charset=UTF-32LE" });
    const a = document.createElement("a");
    a.download = "utf-32_file.txt";
    a.href = url;
    a.click();
    document.body.removeChild(a);
});
</script>
</body>
</html>

CodePudding user response:

I tried something like this...

var fs = require('fs');
var iconv = require('iconv-lite');

var str = '你好,世界';
var buf = iconv.encode(str, 'utf-32le');
fs.writeFileSync('test.txt', buf);

  • Related