Thursday, June 24, 2010

Java theory and practice: Garbage collection in the HotSpot JVM

Summary:  In last month's Java theory and practice, columnist Brian Goetz reviewed the basic algorithms for garbage collection. This month, he goes a step further and examines how the 1.4.1 JVM actually handles garbage collection, including some of the new garbage collection options for multiprocessor systems.





Last month, we looked at the classic garbage collection techniques of reference counting, copying, mark-sweep, and mark-compact. Each of these approaches has advantages and disadvantages in certain situations. For example, copying does well when a large proportion of objects are garbage, but does poorly with many long-lived objects (copying them repeatedly). Conversely, mark-compact does quite well with long-lived objects (copying them only once), but not so well with many short-lived objects. The technique used by the 1.2 and later JVMs, called generational garbage collection, combines these two techniques to get the best of both worlds, and as a bonus provides very low object allocation overhead.
In any application heap, some objects become garbage shortly after their creation, some survive for a long time and then become garbage, and others can remain live for the entirety of the program's run. Empirical studies have shown that for most object-oriented languages, the Java language included, the vast majority of objects -- as much as 98 percent, depending on your metric for object youth -- die young. One can measure an object's age in wall-clock seconds, in total bytes allocated by the memory management subsystem since the object was allocated, or the number of garbage collections since the object was allocated. But no matter how you measure, the studies show the same thing -- most objects die young. The fact that most objects die young has significance for the choice of collector. In particular, copying collectors perform quite well when the majority of objects die young, since copying collectors do not visit dead objects at all; they simply copy the live objects to another heap region, then reclaim the whole of the remaining space in one fell swoop.
Of the objects that survive past their first collection, a significant portion of those will become long-lived or permanent. The various garbage collection strategies perform very differently depending on the mix of short-lived and long-lived objects. Copying collectors work very well when most objects die young, because objects that die young never need to be copied at all. However, the copying collector deals poorly with long-lived objects, repeatedly copying them back and forth from one semi-space to another. Conversely, mark-compact collectors do very well with long-lived objects, because long-lived objects tend to accumulate at the bottom of the heap and then do not need to be copied again. Mark-sweep and mark-compact collectors, however, expend considerably more effort examining dead objects, because they must examine every object in the heap during the sweep phase.
A generational collector divides the heap into multiple generations. Objects are created in the young generation, and objects that meet some promotion criteria, such as having survived a certain number of collections, are then promoted to the next older generation. A generational collector is free to use a different collection strategy for different generations and perform garbage collection on the generations separately.
One of the advantages of generational collection is that it can make garbage collection pauses shorter by not collecting all generations at once. When the allocator is unable to fulfill an allocation request, it first triggers a minor collection, which only collects the youngest generation. Since many of the objects in the young generation will already be dead and the copying collector does not need to examine dead objects at all, minor collection pauses can be quite short and can often reclaim significant heap space. If the minor collection frees enough heap space, the user program can resume immediately. If it does not free enough heap space, it proceeds to collect higher generations until enough memory has been reclaimed. (In the event the garbage collector cannot reclaim enough memory after a full collection, it will either expand the heap or it will throw anOutOfMemoryError.)
Tracing garbage collectors, such as copying, mark-sweep, and mark-compact, all start scanning from the root set, traversing references between objects, until all live objects have been visited.
A generational tracing collector starts from the root set, but does not traverse references that lead to objects in the older generation, which reduces the size of the object graph to be traced. But this creates a problem -- what if an object in the older generation references a younger object, which is not reachable through any other chain of references from a root?
To address this problem, generational collectors must explicitly track references from older objects to younger objects and add these old-to-young references into the root set of the minor collection. There are two ways to create a reference from an old object to a young object. Either one of the references contained in an old object is modified to refer to a young object, or a young object that refers to other young objects is promoted into the older generation.
Whether an old-to-young reference is created by promotion or a pointer modification, the garbage collector needs to have a comprehensive set of old-to-young references when it wants to perform a minor collection. One way to do this would be to trace the old generation, but this clearly has significant overhead. Somewhat better would be to linearly scan the old generation looking for references to young objects. This approach is faster than tracing and has better locality, but is still considerable work.
The mutator and garbage collector can work together to maintain a comprehensive list of old-to-young references as they are created. When objects are promoted into the older generation, the garbage collector can note any old-to-young references that are created as a result of the promotion, which leaves only the tracking of intergenerational references created by pointer modifications.
The garbage collector can track old-to-young references that arise through modifying references held within existing objects in several ways. It could track them in the same manner as maintaining reference counts in reference-counting collectors (the compiler could generate additional instructions surrounding pointer assignments) or could use virtual memory protection on the old generation heap to trap writes to older objects. Another potentially more efficient virtual memory approach would be to use the page modification dirty bits in the old generation heap to identify blocks to scan for objects containing old-to-young pointers.
With a little cleverness, it is possible to avoid the overhead of tracking every pointer modification and inspecting it to see if it crosses generational boundaries. For example, it is not necessary to track stores to local or static variables, because they will already be part of the root set. It may also be possible to avoid tracking pointer stores within constructors that simply initialize fields of newly created objects (so-called initializing stores), as (almost) all objects are allocated in the young generation. In any case, the runtime must maintain an explicit set of references from old objects to young ones and add these references to the root set when collecting the young generation.
In Figure 1, the arrows represent references between objects in the heap. The red arrows represent old-to-young references that must be added to the root set for a minor collection. The blue arrows represent references to old objects, either from the root set or from the young generation, which don't need to be traced when collecting only the young generation.

Figure 1. Intergenerational references
Intergenerational references
The Sun JDKs use an optimized variant of an algorithm called card marking to identify modifications to pointers held in fields of old-generation objects. In this approach, the heap is divided into a set of cards, each of which is usually smaller than a memory page. The JVM maintains a card map, with one bit (or byte, in some implementations) corresponding to each card in the heap. Each time a pointer field in an object in the heap is modified, the corresponding bit in the card map for that card is set. At garbage collection time, the mark bits associated with cards in the old generation are examined, and dirty cards are scanned for objects containing references into the younger generation. Then the mark bits are cleared. There are several costs to card marking – additional space for the card map, additional work to be done on each pointer store, and additional work to be done at garbage collection time. Card marking algorithms can add as little as two or three machine instructions per non-initializing heap pointer store, and entails scanning any objects on dirty cards at minor collection time.
By default, the 1.4.1 JDK divides the heap into two sections, a young generation and an old generation. (Actually, there is also a third section, the permanent space, which is used for storing loaded class and method objects.) The young generation is divided into a creation space, often called Eden, and two survivor semi-spaces, using a copying collector.
The old generation uses a mark-compact collector. Objects are promoted from the young generation to the old generation after they have survived copying a certain number of times. A minor collection will copy live objects from Eden and one of the survivor semi-spaces into the other survivor space, potentially promoting some objects to the older generation. A major collection will collect both the young and old generation. The System.gc() method always triggers a major collection, which is one of the reasons you should use System.gc() sparingly, if at all, because major collections can take much longer than a minor collection. There is no way to programmatically trigger a minor collection.
In addition to the copying and mark-compact collectors used by default, the 1.4.1 JDK also contains four other garbage collection algorithms, each of which is suited to a different purpose. JDK 1.4.1 includes an incremental collector (which has been around since JDK 1.2) and three new collectors for more efficient collection on multiprocessor systems -- the parallel copying collector, the parallel scavenging collector, and the concurrent mark-sweep collector. These new collectors address the problem of the garbage collector being a scalability bottleneck on multiprocessor systems. Figure 2 shows some guidelines on when to choose alternate collection options.

Figure 2. 1.4.1 garbage collection options (courtesy of Folgmann IT-Consulting)
Garbage collection options
The incremental collection option has been part of the JDK since 1.2. Incremental collection reduces garbage collection pauses at the expense of throughput, making it desirable only in cases where shorter collection pauses are very important, such as near-realtime systems.
The algorithm used by the JDK for incremental collection, the Train algorithm, creates a new section of the heap between the old and young generations. These heap sections are divided into "trains," each of which is divided into a sequence of "cars." Each car can be collected separately. Effectively, each train car constitutes a separate generation, which means that not only must old-to-young references be tracked, but also references from older trains to younger trains and older cars to younger cars. This creates significant extra work for the mutator and garbage collector, but permits much shorter collection pauses.
The new collectors in JDK 1.4.1 are all designed to address the issue of garbage collection on multiprocessor systems. Because most garbage collection algorithms stop the world for some period of time, a single-threaded garbage collector can quickly become a scalability bottleneck, because all but one processor are idle while the garbage collector has suspended the user program threads. Two of the new collectors are designed to reduce collection pause time -- the parallel copying collector and the concurrent mark-sweep collector. The other, the parallel scavenge collector, is designed for higher throughput on large heaps.
The parallel copying collector, enabled by the -XX:+UseParNewGC JVM option, is a young-generation copying collector that divides the work of garbage collection among as many threads as there are CPUs. The concurrent mark-sweep collector, enabled by the -XX:+UseConcMarkSweepGC option, is an old-generation mark-sweep collector that stops the world briefly for an initial mark phase (and again later for a brief re-mark phase) and then allows the user program to resume while the garbage collector threads execute concurrently with the user program. The parallel copying collector and concurrent mark-sweep collector are basically concurrent versions of the default copying and mark-compact collectors. The parallel scavenging collector, enabled by -XX:+UseParallelGC, is a young-generation collector optimized for very large (gigabyte and larger) heaps on multiprocessor systems.
With six algorithms to choose from, you're probably wondering which one to use. Figure 2 offers some guidance, dividing collectors into single-threaded and concurrent, and into low-pause and high-throughput. Given what you know about your application and deployment environment, this may be enough to select the appropriate algorithm. For many applications, the default collectors work just fine -- so if you don't have a performance problem, there's no point in inviting more complexity. However, if your application is deployed on a multiprocessor system or uses a very large heap, you may get some performance boost from changing collector options.
JDK 1.4.1 also includes numerous options for tuning garbage collection. You can spend considerable time tweaking these options and measuring their effect, so before you try and tune the garbage collector, you'll probably get a better return on your tuning investment by making sure your application has been thoroughly profiled and optimized first.
The first place to start with tuning garbage collection is to examine the verbose GC output. This will give you information on the frequency, timing, and duration of garbage collection operations. The simplest form of garbage collection tuning is simply expanding the maximum heap size (-Xmx). Copying collection becomes more efficient as the heap grows, so by expanding the heap, you reduce the cost of collection on a per-object basis. In addition to increasing the maximum heap size, you can also use the -XX:NewRatio option to increase the proportion of space assigned to the young generation. You also can specify the size of the young generation explicitly with the -Xmn option. See Resources for a number of articles offering more detailed garbage collection tuning advice.
As the JVM has evolved, the default garbage collector has gotten better and better. The generational collector employed by JDK 1.2 and later offers far better allocation and collection performance than the mark-sweep-compact collector used by earlier JDKs. The 1.4.1 JDK further improves the effectiveness of garbage collection by adding new multithreaded collection options for multiprocessor systems and very large heaps.
Next month, we'll wrap up our exploration of garbage collection by looking at some performance hints and myths about garbage collection, including the true cost of object allocation, the costs and benefits of explicit nulling, and the cost of finalization.



Java theory and practice: Garbage collection and performance

Summary:  The past two installments of Java theory and practice have discussed various techniques for garbage collection and the basics of the JDK 1.4.1 garbage collectors. This month, columnist Brian Goetz looks at the performance impact of the choice of collector, how various coding idioms interact with the garbage collector, and how allocation and other related costs have changed in Java virtual machines over the past several years.





In the early days of Java technology, allocating objects got a pretty bad rap. There were lots of articles (including some by this author) advising developers to avoid creating temporary objects unnecessarily because allocation (and the corresponding garbage-collection overhead) was expensive. While this used to be good advice (in situations where performance was significant), it is no longer generally applicable to all but the most performance-critical situations.
The 1.0 and 1.1 JDKs used a mark-sweep collector, which did compaction on some -- but not all -- collections, meaning that the heap might be fragmented after a garbage collection. Accordingly, memory allocation costs in the 1.0 and 1.1 JVMs were comparable to that in C or C++, where the allocator uses heuristics such as "first-first" or "best-fit" to manage the free heap space. Deallocation costs were also high, since the mark-sweep collector had to sweep the entire heap at every collection. No wonder we were advised to go easy on the allocator.
In HotSpot JVMs (Sun JDK 1.2 and later), things got a lot better -- the Sun JDKs moved to a generational collector. Because a copying collector is used for the young generation, the free space in the heap is always contiguous so that allocation of a new object from the heap can be done through a simple pointer addition, as shown in Listing 1. This makes object allocation in Java applications significantly cheaper than it is in C, a possibility that many developers at first have difficulty imagining. Similarly, because copying collectors do not visit dead objects, a heap with a large number of temporary objects, which is a common situation in Java applications, costs very little to collect; simply trace and copy the live objects to a survivor space and reclaim the entire heap in one fell swoop. No free lists, no block coalescing, no compacting -- just wipe the heap clean and start over. So both allocation and deallocation costs per object went way down in JDK 1.2.

Listing 1. Fast allocation in a contiguous heap
void *malloc(int n) { 
  synchronized (heapLock) {
    if (heapTop - heapStart > n)
      doGarbageCollection();

    void *wasStart = heapStart;
    heapStart += n;
    return wasStart;
  }
}

Performance advice often has a short shelf life; while it was once true that allocation was expensive, it is now no longer the case. In fact, it is downright cheap, and with a few very compute-intensive exceptions, performance considerations are generally no longer a good reason to avoid allocation. Sun estimates allocation costs at approximately ten machine instructions. That's pretty much free -- certainly no reason to complicate the structure of your program or incur additional maintenance risks for the sake of eliminating a few object creations.
Of course, allocation is only half the story -- most objects that are allocated are eventually garbage collected, which also has costs. But there's good news there, too. The vast majority of objects in most Java applications become garbage before the next collection. The cost of a minor garbage collection is proportional to the number of live objects in the young generation, not the number of objects allocated since the last collection. Because so few young generation objects survive to the next collection, the amortized cost of collection per allocation is fairly small (and can be made even smaller by simply increasing the heap size, subject to the availability of enough memory).
The JIT compiler can perform additional optimizations that can reduce the cost of object allocation to zero. Consider the code in Listing 2, where the getPosition() method creates a temporary object to hold the coordinates of a point, and the calling method uses the Point object briefly and then discards it. The JIT will likely inline the call to getPosition() and, using a technique called escape analysis, can recognize that no reference to the Point object leaves the doSomething() method. Knowing this, the JIT can then allocate the object on the stack instead of the heap or, even better, optimize the allocation away completely and simply hoist the fields of the Point into registers. While the current Sun JVMs do not yet perform this optimization, future JVMs probably will. The fact that allocation can get even cheaper in the future, with no changes to your code, is just one more reason not to compromise the correctness or maintainability of your program for the sake of avoiding a few extra allocations.

Listing 2. Escape analysis can eliminate many temporary allocations entirely
void doSomething() { 
  Point p = someObject.getPosition();
  System.out.println("Object is at (" + p.x, + ", " + p.y + ")");
}

...

Point getPosition() { 
  return new Point(myX, myY);
}

Listing 1 shows that while allocation itself is fast, access to the heap structure must be synchronized across threads. So doesn't that make the allocator a scalability hazard? There are several clever tricks JVMs use to reduce this cost significantly. IBM JVMs use a technique called thread-local heaps, by which each thread requests a small block of memory (on the order of 1K) from the allocator, and small object allocations are satisfied out of that block. If the program requests a larger block than can be satisfied using the small thread-local heap, then the global allocator is used to either satisfy the request directly or to allocate a new thread-local heap. By this technique, a large percentage of allocations can be satisfied without contending for the shared heap lock. (Sun JVMs use a similar technique, instead using the term "Local Allocation Blocks.")
Objects with finalizers (those that have a non-trivial finalize() method) have significant overhead compared to objects without finalizers, and should be used sparingly. Finalizeable objects are both slower to allocate and slower to collect. At allocation time, the JVM must register any finalizeable objects with the garbage collector, and (at least in the HotSpot JVM implementation) finalizeable objects must follow a slower allocation path than most other objects. Similarly, finalizeable objects are slower to collect, too. It takes at least two garbage collection cycles (in the best case) before a finalizeable object can be reclaimed, and the garbage collector has to do extra work to invoke the finalizer. The result is more time spent allocating and collecting objects and more pressure on the garbage collector, because the memory used by unreachable finalizeable objects is retained longer. Combine that with the fact that finalizers are not guaranteed to run in any predictable timeframe, or even at all, and you can see that there are relatively few situations for which finalization is the right tool to use.
If you must use finalizers, there are a few guidelines you can follow that will help contain the damage. Limit the number of finalizeable objects, which will minimize the number of objects that have to incur the allocation and collection costs of finalization. Organize your classes so that finalizeable objects hold no other data, which will minimize the amount of memory tied up in finalizeable objects after they become unreachable, as there can be a long delay before they are actually reclaimed. In particular, beware when extending finalizeable classes from standard libraries.
Because allocation and garbage collection at one time imposed significant performance costs on Java programs, many clever tricks were developed to reduce these costs, such as object pooling and nulling. Unfortunately, in many cases these techniques can do more harm than good to your program's performance.
Object pooling is a straightforward concept -- maintain a pool of frequently used objects and grab one from the pool instead of creating a new one whenever needed. The theory is that pooling spreads out the allocation costs over many more uses. When the object creation cost is high, such as with database connections or threads, or the pooled object represents a limited and costly resource, such as with database connections, this makes sense. However, the number of situations where these conditions apply is fairly small.
In addition, object pooling has some serious downsides. Because the object pool is generally shared across all threads, allocation from the object pool can be a synchronization bottleneck. Pooling also forces you to manage deallocation explicitly, which reintroduces the risks of dangling pointers. Also, the pool size must be properly tuned to get the desired performance result. If it is too small, it will not prevent allocation; and if it is too large, resources that could get reclaimed will instead sit idle in the pool. By tying up memory that could be reclaimed, the use of object pools places additional pressure on the garbage collector. Writing an effective pool implementation is not simple.
In his "Performance Myths Exposed" talk at JavaOne 2003, Dr. Cliff Click offered concrete benchmarking data showing that object pooling is a performance loss for all but the most heavyweight objects on modern JVMs. Add in the serialization of allocation and the dangling-pointer risks, and it's clear that pooling should be avoided in all but the most extreme cases.
Explicit nulling is simply the practice of setting reference objects to null when you are finished with them. The idea behind nulling is that it assists the garbage collector by making objects unreachable earlier. Or at least that's the theory.
There is one case where the use of explicit nulling is not only helpful, but virtually required, and that is where a reference to an object is scoped more broadly than it is used or considered valid by the program's specification. This includes cases such as using a static or instance field to store a reference to a temporary buffer, rather than a local variable, or using an array to store references that may remain reachable by the runtime but not by the implied semantics of the program. Consider the class in Listing 3, which is an implementation of a simple bounded stack backed by an array. When pop() is called, without the explicit nulling in the example, the class could cause a memory leak (more properly called "unintentional object retention," or sometimes called "object loitering") because the reference stored in stack[top+1] is no longer reachable by the program, but still considered reachable by the garbage collector.

Listing 3. Avoiding object loitering in a stack implementation
public class SimpleBoundedStack {
  private static final int MAXLEN = 100;
  private Object stack[] = new Object[MAXLEN];
  private int top = -1;

  public void push(Object p) { stack [++top] = p;}

  public Object pop() {
    Object p = stack [top];
    stack [top--] = null;  // explicit null
    return p;
  }
}

In the September 1997 "Java Developer Connection Tech Tips" column (see Resources), Sun warned of this risk and explained how explicit nulling was needed in cases like the pop() example above. Unfortunately, programmers often take this advice too far, using explicit nulling in the hope of helping the garbage collector. But in most cases, it doesn't help the garbage collector at all, and in some cases, it can actually hurt your program's performance.
Consider the code in Listing 4, which combines several really bad ideas. The listing is a linked list implementation that uses a finalizer to walk the list and null out all the forward links. We've already discussed why finalizers are bad. This case is even worse because now the class is doing extra work, ostensibly to help the garbage collector, but that will not actually help -- and might even hurt. Walking the list takes CPU cycles and will have the effect of visiting all those dead objects and pulling them into the cache -- work that the garbage collector might be able to avoid entirely, because copying collectors do not visit dead objects at all. Nulling the references doesn't help a tracing garbage collector anyway; if the head of the list is unreachable, the rest of the list won't be traced anyway.

Listing 4. Combining finalizers and explicit nulling for a total performance disaster -- don't do this!
public class LinkedList {

  private static class ListElement {
    private ListElement nextElement;
    private Object value;
  }

  private ListElement head;

  ...

  public void finalize() { 
    try {
      ListElement p = head;
      while (p != null) {
        p.value = null;
        ListElement q = p.nextElement;
        p.nextElement = null;
        p = q;
      }
      head = null;
    }
    finally {
      super.finalize();
    }
  }
}

Explicit nulling should be saved for cases where your program is subverting normal scoping rules for performance reasons, such as the stack example in Listing 3 (a more correct -- but poorly performing -- implementation would be to reallocate and copy the stack array each time it is changed).
A third category where developers often mistakenly think they are helping the garbage collector is the use of System.gc(), which triggers a garbage collection (actually, it merely suggests that this might be a good time for a garbage collection). Unfortunately, System.gc() triggers a full collection, which includes tracing all live objects in the heap and sweeping and compacting the old generation. This can be a lot of work. In general, it is better to let the system decide when it needs to collect the heap, and whether or not to do a full collection. Most of the time, a minor collection will do the job. Worse, calls toSystem.gc() are often deeply buried where developers may be unaware of their presence, and where they might get triggered far more often than necessary. If you are concerned that your application might have hidden calls to System.gc() buried in libraries, you can invoke the JVM with the -XX:+DisableExplicitGC option to prevent calls to System.gc() and triggering a garbage collection.
No installment of Java theory and practice would be complete without some sort of plug for immutability. Making objects immutable eliminates entire classes of programming errors. One of the most common reasons given for not making a class immutable is the belief that doing so would compromise performance. While this is true sometimes, it is often not -- and sometimes the use of immutable objects has significant, and perhaps surprising, performance advantages.
Many objects function as containers for references to other objects. When the referenced object needs to change, we have two choices: update the reference (as we would in a mutable container class) or re-create the container to hold a new reference (as we would in an immutable container class). Listing 5 shows two ways to implement a simple holder class. Assuming the containing object is small, which is often the case (such as a Map.Entry element in a Map or a linked list element), allocating a new immutable object has some hidden performance advantages that come from the way generational garbage collectors work, having to do with the relative age of objects.

Listing 5. Mutable and immutable object holders
public class MutableHolder {
  private Object value;
  public Object getValue() { return value; }
  public void setValue(Object o) { value = o; }
}

public class ImmutableHolder {
  private final Object value;
  public ImmutableHolder(Object o) { value = o; }
  public Object getValue() { return value; }
}

In most cases, when a holder object is updated to reference a different object, the new referent is a young object. If we update aMutableHolder by calling setValue(), we have created a situation where an older object references a younger one. On the other hand, by creating a new ImmutableHolder object instead, a younger object is referencing an older one. The latter situation, where most objects point to older objects, is much more gentle on a generational garbage collector. If aMutableHolder that lives in the old generation is mutated, all the objects on the card that contain the MutableHolder must be scanned for old-to-young references at the next minor collection. The use of mutable references for long-lived container objects increases the work done to track old-to-young references at collection time. (See last month's article and this month'sResources, which explain the card-marking algorithm used to implement the write barrier in the generational collector used by current Sun JVMs).
A cover story in the July 2003 Java Developer's Journal illustrates how easy it is for good performance advice to become bad performance advice by simply failing to adequately identify the conditions under which the advice should be applied or the problem it was intended to solve. While the article contains some useful analysis, it will likely do more harm than good (and, unfortunately, far too much performance-oriented advice falls into this same trap).
The article opens with a set of requirements from a realtime environment, where unpredictable garbage collection pauses are unacceptable and there are strict operational requirements on how long a pause can be tolerated. The authors then recommend nulling references, object pooling, and scheduling explicit garbage collection to meet the performance goals. So far, so good -- they had a problem and they figured out what they had to do to solve it (although they appear to have failed to identify what the costs of these practices were or explore some less intrusive alternatives, such as concurrent collection). Unfortunately, the article's title ("Avoid Bothersome Garbage Collection Pauses") and presentation suggest that this advice would be useful for a wide range of applications -- perhaps all Java applications. This is terrible, dangerous performance advice!
For most applications, explicit nulling, object pooling, and explicit garbage collection will harm the throughput of your application, not improve it -- not to mention the intrusiveness of these techniques on your program design. In certain situations, it may be acceptable to trade throughput for predictability -- such as real-time or embedded applications. But for many Java applications, including most server-side applications, you probably would rather have the throughput.
The moral of the story is that performance advice is highly situational (and has a short shelf life). Performance advice is by definition reactive -- it is designed to address a particular problem that occurred in a particular set of circumstances. If the underlying circumstances change, or they are simply not applicable to your situation, the advice may not be applicable, either. Before you muck up your program's design to improve its performance, first make sure you have a performance problem and that following the advice will solve that problem.
Garbage collection has come a long way in the last several years. Modern JVMs offer fast allocation and do their job fairly well on their own, with shorter garbage collection pauses than in previous JVMs. Tricks such as object pooling or explicit nulling, which were once considered sensible techniques for improving performance, are no longer necessary or helpful (and may even be harmful) as the cost of allocation and garbage collection has been reduced considerably.


---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Java theory and practice: A brief history of garbage collection

Summary:  The Java language may be the most widely used programming language to rely on garbage collection, but it is by no means the first. Garbage collection has been an integral part of many programming languages, including Lisp, Smalltalk, Eiffel, Haskell, ML, Scheme, and Modula-3, and has been in use since the early 1960s. In this installment ofJava theory and practice, Brian Goetz describes the most common techniques for garbage collection. Over the next several months, he'll look at the garbage collection strategies employed by the 1.4 JVM, some performance implications of various garbage collection strategies, and how (as well as how not) to assist the garbage collector to yield better performance.





The benefits of garbage collection are indisputable -- increased reliability, decoupling of memory management from class interface design, and less developer time spent chasing memory management errors. The well-known problems of dangling pointers and memory leaks simply do not occur in Java programs. (Java programs can exhibit a form of memory leak, more accurately called unintentional object retention, but this is a different problem.) However, garbage collection is not without its costs -- among them performance impact, pauses, configuration complexity, and nondeterministic finalization.
An ideal garbage collection implementation would be totally invisible -- there would be no garbage collection pauses, no CPU time would be lost to garbage collection, the garbage collector wouldn't interact negatively with virtual memory or the cache, and the heap wouldn't need to be any larger than the residency (heap occupancy) of the application. Of course, there are no perfect garbage collectors, but garbage collectors have improved significantly over the past ten years.
The 1.3 JDK includes three different garbage collection strategies; the 1.4.1 JDK includes six, and over a dozen command-line options for configuring and tuning garbage collection. How do they differ? Why do we need so many?
The various garbage collection implementations use different strategies for identification and reclamation of unreachable objects, and they interact differently with the user program and scheduler. Different sorts of applications will have different requirements for garbage collection -- real-time applications will demand short and bounded-duration collection pauses, whereas enterprise applications may tolerate longer or less predictable pauses in favor of higher throughput.
There are several basic strategies for garbage collection: reference counting, mark-sweep, mark-compact, and copying. In addition, some algorithms can do their job incrementally (the entire heap need not be collected at once, resulting in shorter collection pauses), and some can run while the user program runs (concurrent collectors). Others must perform an entire collection at once while the user program is suspended (so-called stop-the-world collectors). Finally, there are hybrid collectors, such as the generational collector employed by the 1.2 and later JDKs, which use different collection algorithms on different areas of the heap.
When evaluating a garbage collection algorithm, we might consider any or all of the following criteria:
  • Pause time. Does the collector stop the world to perform collection? For how long? Can pauses be bounded in time?
  • Pause predictability. Can garbage collection pauses be scheduled at times that are convenient for the user program, rather than for the garbage collector?
  • CPU usage. What percentage of the total available CPU time is spent in garbage collection?
  • Memory footprint. Many garbage collection algorithms require dividing the heap into separate memory spaces, some of which may be inaccessible to the user program at certain times. This means that the actual size of the heap may be several times bigger than the maximum heap residency of the user program.
  • Virtual memory interaction. On systems with limited physical memory, a full garbage collection may fault nonresident pages into memory to examine them during the collection process. Because the cost of a page fault is high, it is desirable that a garbage collector properly manage locality of reference.
  • Cache interaction. Even on systems where the entire heap can fit into main memory, which is true of virtually all Java applications, garbage collection will often have the effect of flushing data used by the user program out of the cache, imposing a performance cost on the user program.
  • Effects on program locality. While some believe that the job of the garbage collector is simply to reclaim unreachable memory, others believe that the garbage collector should also attempt to improve the reference locality of the user program. Compacting and copying collectors relocate objects during collection, which has the potential to improve locality.
  • Compiler and runtime impact. Some garbage collection algorithms require significant cooperation from the compiler or runtime environment, such as updating reference counts whenever a pointer assignment is performed. This creates both work for the compiler, which must generate these bookkeeping instructions, and overhead for the runtime environment, which must execute these additional instructions. What is the performance impact of these requirements? Does it interfere with compile-time optimizations?
Regardless of the algorithm chosen, trends in hardware and software have made garbage collection far more practical. Empirical studies from the 1970s and 1980s show garbage collection consuming between 25 percent and 40 percent of the runtime in large Lisp programs. While garbage collection may not yet be totally invisible, it sure has come a long way.
The problem faced by all garbage collection algorithms is the same -- identify blocks of memory that have been dispensed by the allocator, but are unreachable by the user program. What do we mean by unreachable? Memory blocks can be reached in one of two ways -- if the user program holds a reference to that block in a root, or if there is a reference to that block held in another reachable block. In a Java program, a root is a reference to an object held in a static variable or in a local variable on an active stack frame. The set of reachable objects is the transitive closure of the root set under the points-to relation.
The most straightforward garbage collection strategy is reference counting. Reference counting is simple, but requires significant assistance from the compiler and imposes overhead on the mutator (the term for the user program, from the perspective of the garbage collector). Each object has an associated reference count -- the number of active references to that object. If an object's reference count is zero, it is garbage (unreachable from the user program) and can be recycled. Every time a pointer reference is modified, such as through an assignment statement, or when a reference goes out of scope, the compiler must generate code to update the referenced object's reference count. If an object's reference count goes to zero, the runtime can reclaim the block immediately (and decrement the reference counts of any blocks that the reclaimed block references), or place it on a queue for deferred collection.
Many ANSI C++ library classes, such as string, employ reference counting to provide the appearance of garbage collection. By overloading the assignment operator and exploiting the deterministic finalization provided by C++ scoping, C++ programs can use the string class as if it were garbage collected. Reference counting is simple, lends itself well to incremental collection, and the collection process tends to have good locality of reference, but it is rarely used in production garbage collectors for a number of reasons, such as its inability to reclaim unreachable cyclic structures (objects that reference each other directly or indirectly, like a circularly linked list or a tree that contains back-pointers to the parent node).
None of the standard garbage collectors in the JDK uses reference counting; instead, they all use some form of tracing collector. A tracing collector stops the world (although not necessarily for the entire duration of the collection) and starts tracing objects, starting at the root set and following references until all reachable objects have been examined. Roots can be found in program registers, in local (stack-based) variables in each thread's stack, and in static variables.
The most basic form of tracing collector, first proposed by Lisp inventor John McCarthy in 1960, is the mark-sweep collector, in which the world is stopped and the collector visits each live node, starting from the roots, and marks each node it visits. When there are no more references to follow, collection is complete, and then the heap is swept (that is, every object in the heap is examined), and any object not marked is reclaimed as garbage and returned to the free list. Figure 1 illustrates a heap prior to garbage collection; the shaded blocks are garbage because they are unreachable by the user program:

Figure 1. Reachable and unreachable objects
Reachable and unreachable objects
Mark-sweep is simple to implement, can reclaim cyclic structures easily, and doesn't place any burden on the compiler or mutator like reference counting does. But it has deficiencies -- collection pauses can be long, and the entire heap is visited in the sweep phase, which can have very negative performance consequences on virtual memory systems where the heap may be paged.
The big problem with mark-sweep is that every active (that is, allocated) object, whether reachable or not, is visited during the sweep phase. Because a significant percentage of objects are likely to be garbage, this means that the collector is spending considerable effort examining and handling garbage. Mark-sweep collectors also tend to leave the heap fragmented, which can cause locality issues and can also cause allocation failures even when sufficient free memory appears to be available.
In a copying collector, another form of tracing collector, the heap is divided into two equally sized semi-spaces, one of which contains active data and the other is unused. When the active space fills up, the world is stopped and live objects are copied from the active space into the inactive space. The roles of the spaces are then flipped, with the old inactive space becoming the new active space.
Copying collection has the advantage of only visiting live objects, which means garbage objects will not be examined, nor will they need to be paged into memory or brought into the cache. The duration of collection cycles in a copying collector is driven by the number of live objects. However, copying collectors have the added cost of copying the data from one space to another, adjusting all references to point to the new copy. In particular, long-lived objects will be copied back and forth on every collection.
Copying collectors have another benefit, which is that the set of live objects are compacted into the bottom of the heap. This not only improves locality of reference of the user program and eliminates heap fragmentation, but also greatly reduces the cost of object allocation -- object allocation becomes a simple pointer addition on the top-of-heap pointer. There is no need to maintain free lists or look-aside lists, or perform best-fit or first-fit algorithms -- allocating N bytes is as simple as adding N to the top-of-heap pointer and returning its previous value, as suggested in Listing 1:

Listing 1. Inexpensive memory allocation in a copying collector
void *malloc(int n) { 
    if (heapTop - heapStart < n)
        doGarbageCollection();

    void *wasStart = heapStart;
    heapStart += n;
    return wasStart;
}

Developers who have implemented sophisticated memory management schemes for non-garbage-collected languages may be surprised at how inexpensive allocation is -- a simple pointer addition -- in a copying collector. This may be one of the reasons for the pervasive belief that object allocation is expensive -- earlier JVM implementations did not use copying collectors, and developers are still implicitly assuming allocation cost is similar to other languages, like C, when in fact it may be significantly cheaper in the Java runtime. Not only is the cost of allocation smaller, but for objects that become garbage before the next collection cycle, the deallocation cost is zero, as the garbage object will be neither visited nor copied.
The copying algorithm has excellent performance characteristics, but it has the drawback of requiring twice as much memory as a mark-sweep collector. The mark-compact algorithm combines mark-sweep and copying in a way that avoids this problem, at the cost of some increased collection complexity. Like mark-sweep, mark-compact is a two-phase process, where each live object is visited and marked in the marking phase. Then, marked objects are copied such that all the live objects are compacted at the bottom of the heap. If a complete compaction is performed at every collection, the resulting heap is similar to the result of a copying collector -- there is a clear demarcation between the active portion of the heap and the free area, so that allocation costs are comparable to a copying collector. Long-lived objects tend to accumulate at the bottom of the heap, so they are not copied repeatedly as they are in a copying collector.
OK, so which of these approaches does the JDK take for garbage collection? In some sense, all of them. Early JDKs used a single-threaded mark-sweep or mark-sweep-compact collector. JDKs 1.2 and later employ a hybrid approach, calledgenerational collection, where the heap is divided into several sections based on an object's age, and different generations are collected separately using different collection algorithms.
Generational garbage collection turns out to be very effective, although it introduces several additional bookkeeping requirements at runtime. In next month's Java theory and practice, we'll explore how generational garbage collection works and how it is employed by the 1.4.1 JVM, in addition to all the other garbage collection options offered by the 1.4.1 JVM. In the installment following that, we'll look at the performance impact of garbage collection, including debunking some performance myths related to memory management.


---------------------------------------------------------------------------------------------------------------------------------------------------------------------