How to trim file segment length, so when written as new filepath is not longer than 255 chars for ea-CodePudding

My Java code generates a new path name for an existing file, as part of this, I have to ensure each path segment is no longer than 255 characters because this is illegal for most operating systems.

// No path component can be longer than 255 chars
String[] pathComponents = splitPath(newPath);
for(int i=0;i<pathComponents.length - 1;i  ) {
    if (pathComponents[i]. length() > MAX_FILELENGTH) {
        String shortened = pathComponents[i].substring(0, MAX_FILELENGTH - 1);
        shortened = shortened.trim();
        sb.append(shortened).append(File.separator);
    }
    else {
        sb.append(pathComponents[i]).append(File.separator);
    }
}

This works fine most of the time, but it doesn't work if there are less than 255 Unicode characters but when the Unicode characters are written to the filesystem, some require more than one byte and therefore end up with more than 255 bytes, which isn't caught by test.

I can count bytes instead of characters with

if(pathComponents[i].getBytes(StandardCharsets.UTF_8).length > MAX_FILELENGTH)

I cannot work out a nice way to trim by just the right amount of characters.

CodePudding user response：

As you stated, 255 characters are sometimes 255 bytes but sometimes they are longer. This simple test shows that:

String a = "a";
System.out.println(a);
System.out.println((int)a.charAt(0));  
System.out.println(Arrays.toString(a.getBytes(StandardCharsets.UTF_8)));
// a
// 97
// [97]


String aa = "ä";
System.out.println(aa);
System.out.println((int)aa.charAt(0));    
System.out.println(Arrays.toString(aa.getBytes(StandardCharsets.UTF_8)));
// ä
// 228
// [-61, -92]

As you can see, the ä letter is part of 0-255 space (8 bytes) but it is represented by array with length = 2

What would I do? From question, I see that you generate the string, and in this new-path generator, I would create chars only in ASCII space (0-127). Then, you will be sure that generated string has one character as one byte, and string.length() will be the same as getBytes().length

The following code snippet shows that at 127th value everything is one byte long, and afterwards it is two byte long array. And you can also use that rule to shorten the string.

for (int i = 0; i < 255; i  ) {
    char c = (char)i;
    String s = String.valueOf(c);
    System.out.println(i   "-> "   s   " ->"   Arrays.toString(s.getBytes(StandardCharsets.UTF_8)));
}

// ...
// ...
// 125-> } ->[125]
// 126-> ~ ->[126]
// 127->  ->[127]
// 128->  ->[-62, -128]
// 129->  ->[-62, -127]
// 130->  ->[-62, -126]

CodePudding user response：

I agree with Mark Rotteveel comment that the assumption of the OS filename being 255 bytes limit isn't safe one, and you need to know the charset of OS filenames.

That said, to answer the question you asked: in order for you to split a String[] up according to some rule on max length of bytes of each component you'd need to iterate through the character or code points until the converted size exceeds the max.

If you don't handle code points you could do this by writing and flushing a character at a time to OutputStreamWriter backed by ByteArrayOutputStream and split up components if next character means byteArrayOutput.size() > MAX.

To handle code points you could try this example which iterates the code points , finds their corresponding size in bytes, then assembles as sub-strings which when converted to some character set will all keep to your byte length limit:

public static void main(String ... args) {

    // Low value for tests:
    int maxSizeInBytes = 5; 

    // You'd need to know what this is:
    Charset osPathCharset = StandardCharsets.UTF_8;

    ArrayList<String> split = new ArrayList<>();

    System.out.println("splitting " String.join(File.separator, args));
    for (String s : args) {
        System.out.println("s=" s  " chars#=" s.length()  " bytes#=" s.getBytes(osPathCharset).length);

        byte[][] cpAsBytes = s.codePoints().mapToObj(Character::toString).map(c -> c.getBytes(osPathCharset)).toArray(byte[][]::new);

        StringBuilder b = new StringBuilder();
        int avail = maxSizeInBytes;
        for(int i = 0; i < cpAsBytes.length; i  ) {
            if (avail < cpAsBytes[i].length) {
                split.add(b.toString());
                avail = maxSizeInBytes;
                b.setLength(0);
            }
            avail -= cpAsBytes[i].length;
            b.appendCodePoint(s.codePointAt(i));
        }
        if (b.length() > 0)
            split.add(b.toString());
    }

    System.out.println("split as: " String.join(File.separator, split.toArray(String[]::new)));
    for (String s : split) {
        System.out.println("Part s=" s  " chars#=" s.length()  " bytes#=" s.getBytes(osPathCharset).length);
    }
}

Obviously this isn't memory friendly, it creates a stream of codePoints as String then byte[] and isn't robustly tested so I may delete this answer at some stage. I tried it with:

main("å2ø4æ","12345", "67890abcdef");

Which prints:

splitting å2ø4æ\12345\67890abcdef
s=å2ø4æ chars#=5 bytes#=8
s=12345 chars#=5 bytes#=5
s=67890abcdef chars#=11 bytes#=11
split as: å2ø\4æ\12345\67890\abcde\f
Part s=å2ø chars#=3 bytes#=5
Part s=4æ chars#=2 bytes#=3
Part s=12345 chars#=5 bytes#=5
Part s=67890 chars#=5 bytes#=5
Part s=abcde chars#=5 bytes#=5
Part s=f chars#=1 bytes#=1