Home > Software design >  Check if file/blob object is valid UTF-8
Check if file/blob object is valid UTF-8

Time:08-14

I need a function that can check if a file or blob object is valid UTF-8. I can get the text and check for � characters, but if the string has that character to begin with, the function would mark it as invalid.

function isUTF8(blob) {
  return new Promise(async resolve => {
    const text = await blob.text();
    resolve(!~text.indexOf("�"));
  });
}

// "�" is valid utf-8 but the function returns false
isUTF8(new Blob(["�"])).then(console.log);

// returns true
isUTF8(new Blob(["example"])).then(console.log);

CodePudding user response:

You can use the TextDecoder API:

async function isUTF8(blob) {
  const decoder = new TextDecoder('utf-8', { fatal: true });
  const buffer = await blob.arrayBuffer();
  try {
    decoder.decode(buffer);
  } catch (e) {
    if (e instanceof TypeError)
      return false;
    throw e;
  }
  return true;
}

(async () => {

console.log(await isUTF8(new Blob(
  [new Uint8Array([0x80])]))); // false
console.log(await isUTF8(new Blob(
  [new Uint8Array([0xef, 0xbf, 0xbd])]))); // true
console.log(await isUTF8(new Blob(
  ["\ufffd"]))); // true
console.log(await isUTF8(new Blob(
  ["example"]))); // true

})().catch(e => console.warn(e));

The above loads the entire Blob into an ArrayBuffer for simplicity. If memory-efficiency becomes an issue, you may look into using the .stream() method to process the Blob in parts, without holding it in memory in its entirety.

  • Related