I need a function that can check if a file or blob object is valid UTF-8. I can get the text and check for � characters, but if the string has that character to begin with, the function would mark it as invalid.
function isUTF8(blob) {
return new Promise(async resolve => {
const text = await blob.text();
resolve(!~text.indexOf("�"));
});
}
// "�" is valid utf-8 but the function returns false
isUTF8(new Blob(["�"])).then(console.log);
// returns true
isUTF8(new Blob(["example"])).then(console.log);
CodePudding user response:
You can use the TextDecoder
API:
async function isUTF8(blob) {
const decoder = new TextDecoder('utf-8', { fatal: true });
const buffer = await blob.arrayBuffer();
try {
decoder.decode(buffer);
} catch (e) {
if (e instanceof TypeError)
return false;
throw e;
}
return true;
}
(async () => {
console.log(await isUTF8(new Blob(
[new Uint8Array([0x80])]))); // false
console.log(await isUTF8(new Blob(
[new Uint8Array([0xef, 0xbf, 0xbd])]))); // true
console.log(await isUTF8(new Blob(
["\ufffd"]))); // true
console.log(await isUTF8(new Blob(
["example"]))); // true
})().catch(e => console.warn(e));
The above loads the entire Blob
into an ArrayBuffer
for simplicity. If memory-efficiency becomes an issue, you may look into using the .stream()
method to process the Blob
in parts, without holding it in memory in its entirety.