Java Strings No Longer Share Storage
Heinz M. Kabutz (via Hacker News):
From Java 1.0 up to 1.6,
String
tried to avoid creating newchar[]
’s. Thesubstring()
method would share the same underlyingchar[]
, with a different offset and length. For example, inStringChars
we have twoString
s, with "hello" a substring of "hello_world". However, they share the samechar[]
.
This is no longer the case with Java 7:
“Why this change?”, you may ask. It turns out that too many programmers used
substring()
as a memory saving method. Let’s say that you have a 1 MBString
, but you actually only need the first 5 KB. You could then create a substring, expecting the rest of that 1 MBString
to be thrown away. Except it didn’t. Since the newString
would share the same underlyingchar[]
, you would not save any memory at all. The correct code idiom was therefore to append the substring to an emptyString
, which would have the side effect of always producing a new unsharedchar[]
in the case that theString
length did not correspond to the char[] length.
Apple’s GCD actually employs the opposite optimization: sometimes a larger data object shares the storage of two smaller ones, unless you specifically ask for contiguous storage.
So, it’s a trade-off between the original and new behaviour; the original way more or less caps the memory usage at the size of the original string, but at the expense of not being able to GC it if even a single substring exists, while the new way increases memory usage for each substring generated but does not prevent any of the strings from being GC’d.
This is why the article’s code example yields such a huge difference in memory usage in Java 6 vs Java 7; it is effectively a sort of "anti-pattern" when used against the new
substring()
method. (i.e. iterating through a large string and generating lots of sub-strings).
The thing I mind is that there is now no way to get the old behavior.
String
is a final class, so you cannot override it and add a field, even. You can roll your own - if there is no code you do not control that takes a string. (And if you don’t mind having to write your own string class!)
I would not have objected to releasing this change in Java 1.7. But releasing it in a BUG FIX RELEASE causes me to lose confidence in the maintainers of the JVM. One of the reasons that large companies like mine build in Java is because of Sun’s long history of extremely careful attention to backward compatibility. Oracle is no Sun.
I agree - echoing the sentiments of another commentator here, I feel like one of the tenets of Java is backwards compatibility. While the change doesn’t affect functionality, it can turn code that previously had a space complexity of O(1) into one that is O(n). This is probably a Bad Thing.
I’m the author of the
substring()
change though in total disclosure the work and analysis on this began long before I took on the task. As has been suggested in the analysis here there were two motivations for the change;
- reduce the size of
String
instances.String
s are typically 20-40% of common apps footprint. Any change with increases the size ofString
instances would dramatically increase memory pressure. This change toString
came in at the same time as the alternativeString
hash code and we needed another field to cache the additional hash code. The offset/count removal afforded us the space we needed for the added hash code cache. This was the trigger.- avoid memory leakage caused by retained substrings holding the entire character array. This was a longstanding problem with many apps and was quite a significant in many cases. Over the years many libraries and parsers have specifically avoided returning substring results to avoid creating leaked
String
s.[…]
The comments about the substring operation becoming O(n) assume that the substring result is allocated in the general heap. This is not commonly the case and allocation in the TLAB is very much like
malloca()
--allocation merely bumps a pointer.[…]
We investigated the regressions to see if performance was still acceptable and correctness was maintained. The most significant performance drop turned out to be in an obsolete benchmark which did hundreds of random substrings on a 1MB string and put the substrings into a map. It then later compared the map contents to verify correctness. We concluded that this case was not representative of common usage. Most other applications saw positive footprint and performance improvements or no significant change at all. A few apps, generally older parsers, had minor footprint growth.
Our importer went from a few minutes to parse a couple gigabytes of data to literally centuries. In the context of theoretical computer science that means correctness is preserved. In the real world, however, this means that the program stops progressing until a frustrated user presses the cancel button and calls our hotline.
[…]
It could have been so easy. Introduce a new function called something like
subcopy()
. Makesubstring()
deprecated. In the deprecation comment, explain the memory leak problem and announce thatsubstring()
is schedule for removal in java 2.0. Port the jdk and glassfish and your other applications which might have a problem to usesubcopy()
everywhere when available. Check for performance regressions. Once java 2.0 is released, you can reclaim the memory for the offset and index variables.And here is he crux of the problem: there is no java 2.0. The optimal time frame for making a set of major changes to the language has already passed, and nobody dares to propose it now. What you do instead is to release backwards incompatible changes anyway, as we see here, because you cannot fix all the old problems in any other way. This was already bad when upgrading between minor versions. Now we get the same in bugfix releases, and additionally, we need to look up some new bizzare numbering scheme to see which bugfix release is actually just fixing bugs and which isn’t.