Home > Enterprise >  Detect wrong UTF8 character on string
Detect wrong UTF8 character on string

Time:11-07

I am reading a csv file in node.js that contains urls, I want to be able to detect when a string contains this character � or any other character that is not the proper UTF8 symbol.

Wrong URL I want to be able to detect:

'https://example.com/v�hicules-de-location/france'

Right URL I want to ignore

'https://example.com/véhicules-de-location/france'

Is there an easy way with JavaScript to do that?

CodePudding user response:

Try this regex method check if javascript string is valid UTF-8 which will detect invalid UTF-8.

    const regex = new RegExp(/[^\x20-\x7E] /g);
    regex.test("https://example.com/v�hicules-de-location/france");//true
    regex.test("https://example.com/véhicules-de-location/france"); //false

CodePudding user response:

The file comes with the � already in there, I can see it when I open it with a text editor, it's already there, so I do not undertand all this talk about "How you read the file" once the � is there, there is nothing it can be done to encode it properly.

In the end I am doing this, cause the 'corrupted' character comes like that already and is not something I can control.

....

import csv from 'csv-parser'
import { Readable } from 'stream'

export default async (buffer) => {
  const json = []

  return new Promise((resolve, reject) => {
    Readable.from(buffer)
      .pipe(csv())
      .on('data', (data) => {
        const corrupted = Object.values(data).some((entry) => /�/.test(entry))

        if (corrupted) reject('encoding')

        json.push(data) 
      })
      .on('end', async () => {
        resolve(writeFile ....)
      })
  })
}
  • Related