Monday, July 20, 2015

Java Strings No Longer Share Storage

Heinz M. Kabutz (via Hacker News):

From Java 1.0 up to 1.6, String tried to avoid creating new char[]’s. The substring() method would share the same underlying char[], with a different offset and length. For example, in StringChars we have two Strings, with "hello" a substring of "hello_world". However, they share the same char[].

This is no longer the case with Java 7:

“Why this change?”, you may ask. It turns out that too many programmers used substring() as a memory saving method. Let’s say that you have a 1 MB String, but you actually only need the first 5 KB. You could then create a substring, expecting the rest of that 1 MB String to be thrown away. Except it didn’t. Since the new String would share the same underlying char[], you would not save any memory at all. The correct code idiom was therefore to append the substring to an empty String, which would have the side effect of always producing a new unshared char[] in the case that the String length did not correspond to the char[] length.

Apple’s GCD actually employs the opposite optimization: sometimes a larger data object shares the storage of two smaller ones, unless you specifically ask for contiguous storage.


So, it’s a trade-off between the original and new behaviour; the original way more or less caps the memory usage at the size of the original string, but at the expense of not being able to GC it if even a single substring exists, while the new way increases memory usage for each substring generated but does not prevent any of the strings from being GC’d.

This is why the article’s code example yields such a huge difference in memory usage in Java 6 vs Java 7; it is effectively a sort of "anti-pattern" when used against the new substring() method. (i.e. iterating through a large string and generating lots of sub-strings).


The thing I mind is that there is now no way to get the old behavior. String is a final class, so you cannot override it and add a field, even. You can roll your own - if there is no code you do not control that takes a string. (And if you don’t mind having to write your own string class!)


I would not have objected to releasing this change in Java 1.7. But releasing it in a BUG FIX RELEASE causes me to lose confidence in the maintainers of the JVM. One of the reasons that large companies like mine build in Java is because of Sun’s long history of extremely careful attention to backward compatibility. Oracle is no Sun.


I agree - echoing the sentiments of another commentator here, I feel like one of the tenets of Java is backwards compatibility. While the change doesn’t affect functionality, it can turn code that previously had a space complexity of O(1) into one that is O(n). This is probably a Bad Thing.


I’m the author of the substring() change though in total disclosure the work and analysis on this began long before I took on the task. As has been suggested in the analysis here there were two motivations for the change;

  • reduce the size of String instances. Strings are typically 20-40% of common apps footprint. Any change with increases the size of String instances would dramatically increase memory pressure. This change to String came in at the same time as the alternative String hash code and we needed another field to cache the additional hash code. The offset/count removal afforded us the space we needed for the added hash code cache. This was the trigger.
  • avoid memory leakage caused by retained substrings holding the entire character array. This was a longstanding problem with many apps and was quite a significant in many cases. Over the years many libraries and parsers have specifically avoided returning substring results to avoid creating leaked Strings.


The comments about the substring operation becoming O(n) assume that the substring result is allocated in the general heap. This is not commonly the case and allocation in the TLAB is very much like malloca()--allocation merely bumps a pointer.


We investigated the regressions to see if performance was still acceptable and correctness was maintained. The most significant performance drop turned out to be in an obsolete benchmark which did hundreds of random substrings on a 1MB string and put the substrings into a map. It then later compared the map contents to verify correctness. We concluded that this case was not representative of common usage. Most other applications saw positive footprint and performance improvements or no significant change at all. A few apps, generally older parsers, had minor footprint growth.


Our importer went from a few minutes to parse a couple gigabytes of data to literally centuries. In the context of theoretical computer science that means correctness is preserved. In the real world, however, this means that the program stops progressing until a frustrated user presses the cancel button and calls our hotline.


It could have been so easy. Introduce a new function called something like subcopy(). Make substring() deprecated. In the deprecation comment, explain the memory leak problem and announce that substring() is schedule for removal in java 2.0. Port the jdk and glassfish and your other applications which might have a problem to use subcopy() everywhere when available. Check for performance regressions. Once java 2.0 is released, you can reclaim the memory for the offset and index variables.

And here is he crux of the problem: there is no java 2.0. The optimal time frame for making a set of major changes to the language has already passed, and nobody dares to propose it now. What you do instead is to release backwards incompatible changes anyway, as we see here, because you cannot fix all the old problems in any other way. This was already bad when upgrading between minor versions. Now we get the same in bugfix releases, and additionally, we need to look up some new bizzare numbering scheme to see which bugfix release is actually just fixing bugs and which isn’t.

1 Comment RSS · Twitter

Can't really blame Oracle for this. Apple does the same thing now. Linux has been doing it for years. There is no such thing as a "BUG FIX RELEASE" anymore. The industry standard is now continuous integration. Major version numbers have little meaning outside of the marketing department. Any change can go into any new build. Sun didn't used to do it that way, but Sun isn't in business anymore are they? Vendors don't care about 3rd party software developers anymore. They are marketing their own software. It is our responsibility to test our software with each and every minor release. If it is broken, then we should file a bug report. We should not, however, assume that it will get fixed. Those beta builds are there to highlight bugs in the vendor layer so that we can patch our apps to work around them. Vendors no longer guarantee fixes. Vendors don't even guarantee testing anymore. That has been crowdsourced and is now our responsibility.

Leave a Comment