Based on this thread I tried to create a blob with utf 32 encoding and BOM of FF FE 00 00(UTF-32LE representation) as follows:
var BOM = new Uint8Array([0xFF,0xFE,0x00,0x00]);
var b = new Blob([ BOM, "➀➁➂ Test" ]);
var url = URL.createObjectURL(b);
open(url);
But the file content gets corrupted and gets changed to a different language. What is the correct way to convert a string to a file with utf-32le format?
Edit: Im trying this in browser environment
CodePudding user response:
Note: I'm assuming you're doing this in a browser, since you used Blob and Node.js only recently got Blob support, and you referenced a question doing this in a browser.
You're just setting the BOM, you're not handling converting the data. As it says in MDN's documentation, any strings in the array will be encoded using UTF-8. So you have a UTF-32LE BOM followed by UTF-8 data.
Surprisingly (to me), the browser platform doesn't seem to have a general-purpose text encoder (TextEncoder
just encodes UTF-8), but JavaScript strings provide a means of iterating through their code points (not just code units) and getting the actual Unicode code point value. (If those terms are unfamiliar, my blog post What is a string? may help.) So you can get that number and convert it into four little-endian bytes. DataView
provides a convenient way to do that.
Finally, you'll want to specify the charset in the blob's MIME type (the BOM itself isn't sufficient). I expected text/plain; charset=UTF-32LE
to work, but it doesn't, at least not in Chromium browsers. There's probably some legacy reason for that. text/html
works (on its own), but we may as well be specific and do text/html; charset=UTF-32LE
.
See comments:
function getUTF32LEUrl(str) {
// The UTF-32LE BOM
const BOM = new Uint8Array([0xFF,0xFE,0x00,0x00]);
// A byte array and DataView to use when converting 32-bit LE to bytes;
// they share an underlying buffer
const uint8 = new Uint8Array(4);
const view = new DataView(uint8.buffer);
// Convert the payload to UTF-32LE
const utf32le = Array.from(str, (ch) => {
// Get the code point
const codePoint = ch.codePointAt(0);
// Write it as a 32-bit LE value in the buffer
view.setUint32(0, codePoint, true);
// Read it as individual bytes and create a plain array of them
return Array.from(uint8);
}).flat(); // Flatten the array of arrays into one long byte sequence
// Create the blob and URL
const b = new Blob(
[ BOM, new Uint8Array(utf32le)],
{ type: "text/html; charset=UTF-32LE"} // Set the MIME type
);
const url = URL.createObjectURL(b);
return url;
}
Beware, though, that the specification "prohibits" browsers from supporting UTF-32 (either LE or BE) for HTML:
13.2.3.3 Character encodings
User agents must support the encodings defined in Encoding, including, but not limited to, UTF-8, ISO-8859-2, ISO-8859-7, ISO-8859-8, windows-874, windows-1250, windows-1251, windows-1252, windows-1254, windows-1255, windows-1256, windows-1257, windows-1258, GBK, Big5, ISO-2022-JP, Shift_JIS, EUC-KR, UTF-16BE, UTF-16LE, UTF-16BE/LE, and x-user-defined. User agents must not support other encodings.
Note: The above prohibits supporting, for example, CESU-8, UTF-7, BOCU-1, SCSU, EBCDIC, and UTF-32. This specification does not make any attempt to support prohibited encodings in its algorithms; support and use of prohibited encodings would thus lead to unexpected behavior. [CESU8] [UTF7] [BOCU1] [SCSU]
You might be better off with one of the UTF-16s, given that you're using window.open
to open the result. (For downloading, UTF-32 is fine if you really want a UTF-32 file.)
Sadly, Stack Snippets won't let you open a new window, but here's a full example you can copy and paste to run locally:
<!doctype html>
<html>
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=Edge">
<title>UTF-32 Conversion</title>
<link rel="shortcut icon" href="favicon.png">
<style>
body, html {
height: 100%;
width: 100%;
margin: 0;
padding: 0;
box-sizing: border-box;
}
*, *:before, *:after {
box-sizing: inherit;
}
</style>
</head>
<body>
<input type="button" value="Open" id="open">
<input type="button" value="Download" id="download">
<script type="module">
function getUTF32LEUrl(str, mimeType) {
// The UTF-32LE BOM
const BOM = new Uint8Array([0xFF,0xFE,0x00,0x00]);
// A byte array and DataView to use when converting 32-bit LE to bytes;
// they share an underlying buffer
const uint8 = new Uint8Array(4);
const view = new DataView(uint8.buffer);
// Convert the payload to UTF-32LE
const utf32le = Array.from(str, (ch) => {
// Get the code point
const codePoint = ch.codePointAt(0);
// Write it as a 32-bit LE value in the buffer
view.setUint32(0, codePoint, true);
// Read it as individual bytes and create a plain array of them
return Array.from(uint8);
}).flat(); // Flatten the array of arrays into one long byte sequence
// Create the blob and URL
const b = new Blob(
[ BOM, new Uint8Array(utf32le)],
mimeType // Set the MIME type
);
const url = URL.createObjectURL(b);
return url;
}
document.getElementById("open").addEventListener("click", () => {
const str = "➀➁➂ Test";
const url = getUTF32LEUrl(str, { type: "text/html; charset=UTF-32LE" });
window.open(url);
});
document.getElementById("download").addEventListener("click", () => {
const str = "➀➁➂ Test";
const url = getUTF32LEUrl(str, { type: "text/plain; charset=UTF-32LE" });
const a = document.createElement("a");
a.download = "utf-32_file.txt";
a.href = url;
a.click();
document.body.removeChild(a);
});
</script>
</body>
</html>
CodePudding user response:
I tried something like this...
var fs = require('fs');
var iconv = require('iconv-lite');
var str = '你好,世界';
var buf = iconv.encode(str, 'utf-32le');
fs.writeFileSync('test.txt', buf);