I am reading a csv file in node.js
that contains urls, I want to be able to detect when a string contains this character � or any other character that is not the proper UTF8 symbol.
Wrong URL I want to be able to detect:
'https://example.com/v�hicules-de-location/france'
Right URL I want to ignore
'https://example.com/véhicules-de-location/france'
Is there an easy way with JavaScript to do that?
CodePudding user response:
Try this regex method check if javascript string is valid UTF-8 which will detect invalid UTF-8.
const regex = new RegExp(/[^\x20-\x7E] /g);
regex.test("https://example.com/v�hicules-de-location/france");//true
regex.test("https://example.com/véhicules-de-location/france"); //false
CodePudding user response:
The file comes with the � already in there, I can see it when I open it with a text editor, it's already there, so I do not undertand all this talk about "How you read the file" once the � is there, there is nothing it can be done to encode it properly.
In the end I am doing this, cause the 'corrupted' character comes like that already and is not something I can control.
....
import csv from 'csv-parser'
import { Readable } from 'stream'
export default async (buffer) => {
const json = []
return new Promise((resolve, reject) => {
Readable.from(buffer)
.pipe(csv())
.on('data', (data) => {
const corrupted = Object.values(data).some((entry) => /�/.test(entry))
if (corrupted) reject('encoding')
json.push(data)
})
.on('end', async () => {
resolve(writeFile ....)
})
})
}