Home > OS >  Java String contains/indexof fails due to wrong encoding from local file
Java String contains/indexof fails due to wrong encoding from local file

Time:08-10

EDIT: I have a semi-working solution at the bottom. Or, the original text:

I have a local CSV file. The file is encoded in utf16le. I want to read the file into memory in java, modify it, then write it out. I have been having incredibly strange problems for hours.

The source of the file is Facebook leads generation. It is a CSV. Each line of the file contains the text "2022-08-08". However when I read in the line with a buffered reader, all String methods fail. contains("2022-08-08") returns false. I print out the line directly after checking, and it indeed contains the text "2022-08-08". So the String methods are totally failing.

I think it's possibly due to encoding but I'm not sure. I tried pasting the code into this website for help, but any part of the code that includes copy pasted strings from the CSV file refuses to paste into my browser.

int i = s.indexOf("2022");
if (i < 0) {
    System.out.println(s.contains("2022")   ", " s);
    continue;
}

Prints: false, 2022-08-08T19:57:51 07:00

There are tons of invisible characters in the CSV file and in my IDE everywhere I have copy pasted from the file. I know the characters are there because when I backspace them it deletes the invisible character instead of the actual character I would expect it to delete.

Please help me.

EDIT:

This code appears to fix the problem. I think partially the problem is Facebook's encoding of the file, and partially because the file is from user generated inputs and there are a few very strange inputs. If anyone has more to add or a better solution I will award it. Not sure exactly why it works. Combined from different sources that had sparse explanation.

Is there a way to determine the encoding automatically? Windows Notepad is able to do it.

BufferedReader fr = new BufferedReader(new InputStreamReader(new FileInputStream(new File("C:\\New folder\\form.csv")), "UTF-16LE"));
BufferedWriter fw = Files.newBufferedWriter(Paths.get("C:\\New folder", "form3.txt"));

String s;
while ((s = fr.readLine()) != null) {
    s = s.replaceAll("\\p{C}", "?").replaceAll("[^A-Za-z0-9],", "").replaceAll("[^\\x00-\\x7F]", "");
    //doo stuff with s normally
}

CodePudding user response:

You can verify what you're getting from the stream by

byte[] b = s.getBytes(StandardCharsets.UTF_16BE);
System.out.println(Arrays.toString(b));

CodePudding user response:

I think the searching condition for indexOf could be wrong:

int i = s.indexOf("2022");
if (i < 0) {
    System.out.println(s.contains("2022")   ", " s);
    continue;
}

Maybe the condition should be (i != -1), if I'm not wrong too much.

It's a little tricky, because for (i < 0) the string should not contain "2022".

  • Related