Home > Back-end >  Define StringBuilder capacity when number of characters is unknown
Define StringBuilder capacity when number of characters is unknown

Time:11-14

I'm aware that, for good practice, StringBuilder should be initialised with a capacity value of the expected content. Otherwise, increasing the size after compilation is going to be an expensive operation.

My question is, if we don't know the expected size, how should one go about it? Is there a standard value/way to avoid expensive operations under the hood?

If not, is there potentially a way of alarming/logging in the code if the capacity is bigger than the value given upon initialisation?

CodePudding user response:

I'm aware that, for good practice, StringBuilder should be initialised with a capacity value of the expected content. Otherwise, increasing the size after compilation is going to be an expensive operation.

This is a wildly incorrect statement. It is very bad practice to do this. Even if you know exactly how large it'll be.

If I see this code:

StringBuilder sb = new StringBuilder(in1.length()   in2.length() * 3   loaded ? suffixLen : 0);

Then this is an additional thing to worry about, test, and keep up to date. I would assume if all this is present that for whatever reason somebody did some performance testing and actually figured out that this saves a worthwhile chunk of cycles, and somehow, in a fit of idiocy, neglected to write an enlightening comment and link to the JMH or profiler result analysis to verify this conclusion.

So, I'd either painstakingly attempt to manually analyse precisely if the calculation is still correct after an update to this code, or, I'd fix the problem and add the tests (and then be utterly befuddled, when, of course, the profile review shows this code is utterly inconsequential), or, I'd go through the considerable trouble of writing an assert based test case that will run the entire operation and then verify at the end that the size calculation done at the top is, in fact, correct.

I don't think you fully grasp why the hyperbolic premature optimization is the root of all evil statement is so popular.

Here's the problem. 99% of the system's resources are spent on 1% of the code. That's not an exaggeration; in fact, that is likely understating the issue.

Developer time is not infinite, and even if it was, the programmer's ability to comprehend code and focus on the relevant parts, is limited because, in the end, they are humans. Spending additional code that needs to be parsed and understood by human eyeballs and brains is therefore bad if the code does something irrelevant. We're literally talking about the same order of magnitude as you throwing a glass of water into the ocean down by the seashore in europe and then watching the water levels rise in manhattan. Beyond any and all ability to measure, and utterly incomprehensive. Bold does not do sufficient justice to how little it matters. Even if this code runs 100 million times a day for 15 years, it amounts to perhaps 5 cents in IAAS deployment costs total over that entire decade and a half, and that's if there is even a performance impact, which often there isn't because modern VMs, GCs, OSes, and CPU architectures get up to some crazy shenanigans.

Furthermore, the system optimizes. Optimizers, such as JVM hotspot engines, are in the end pattern matching machines. They find commonly used patterns and recognize how to run them as efficient as possible. By writing code in ways that nobody else does, it is highly unlikely that code is going to actually outperform the common (idiomatic) case. Most likely because it just doesn't matter, and even if it does, because the idiomatic case gets optimized much more readily.

Here is a trivial example:

List<String> someList = new ArrayList<String>();
for (int i = 0; i < 10000; i  ) someList.add(someRandomString);
String[] arrayForm = someList.toArray(new String[0]);

Here you may go: Huh, well, we can optimize this code a little bit and pass new String[10000] instead; this saves the system from having to allocate an admittedly small object (a 0-size string array).

You would be wrong. The above code, with the new String[0], is in fact faster. How can that be? Optimizers, with pattern matching. They recognize the pattern and realize that the system can create a new array of the requisite size, not zero it out, and then run the code that fills it. Whereas the optimization patterns do not include the new String[reqSize] variant where the system could in theory also realize it can allocate the array and then omit zeroing out the array (which the JVM spec guarantees, which merely means the spec guarantees you can never observe that it wasn't zeroed out; it doesn't actually mean the JVM must zero it out, that's where the pattern optimization of not doing so is coming from). However, it doesn't do that - not common enough, and somewhat more complicated.

I'm not saying that new StringBuilder() is neccessarilyt faster than new StringBuilder(knownSize). I'm saying it:

  • 99.9% of the time literally does not make one iota of difference. Not a single nanosecond - the speedup is entirely theoretical: No performance test of any stripe can detect the difference. If a tree falls in the forest, and all that.
  • You have no idea when that 0.1% of the time even is, or if it's not actually straight up never - 0%. Between a CPU that caches and pipelines (did you know modern CPUs cannot access memory? At all? I bet you didn't. The basic von neumann model of how CPUs work? Totally misleads you if you try to performance analyse machine code if you do that) - VMs, garbage collectors (did you know that garbage is free but live objects are expensive? Re-using an object is in fact more expensive that creating a ton of fast garbage.. depending on many factors, of course this too is an oversimplification. That's the real point: This is an intractable thing; you cannot just look at code and jump to conclusions about performance) - you stand no chance to know what's 'faster'.

The only right move is to write code as simple and as clean as you can ('clean' defined as: When you look at it, you jump to conclusions, and these conclusions are correct. It is easy to adjust in the face of changing requirements, and flexible in how it connects to the rest of the codebase). IF (big if!) real life situations result in a performance issue, you first run a profiler so you know the 1% of the code that is in any way relevant, and then you go ham on that, with JMH benchmarks and all sorts of performance experiments to optimize the heck out of it. If your code is clean, that's great, because almost always this requires adjusting how the code that calls into the 'hot path' or where the code flows to out of the 'hot path' - and the cleaner your code the easier that will be.

Needless performance optimization almost invariably reduces flexibility, and makes code harder to understand.

Hence, objectively, micro-optimizing like this just makes your code slower and buggier for literally no benefit. Not even a tiny, almost immeasurable one.

Hence, the advice is silly. The only correct call is new StringBuilder() - no pre-configured size. The one and only excuse you have to write new StringBuilder(presetCapacity) is if there's a lengthy comment that immediately precedes it that lays out in a lot of detail, or links to a ticket, the exact performance study done to indicate this indeed fixes a real performance issue and how to recreate that study, and on what schedule it should be revisited.

  • Related