Object instantiation and heap size in ColdFusion
I have been experimenting lately with an indexing routine for Apache Solr. The application I am using to run the routine is written in Model-Glue 2 and heavily leverages a component model for data access and business logic.
Solr allows a user to add documents to its index via HTTP POST of an XML document in a specified format. The engine allows batch adds by concatenating multiple documents into a single XML document, with a <doc> element for each document to be added to the index.
When I first wrote the indexing routine, I created an object for each document record from my database and passed the objects in an array to the manager code for assembly into an XML document for indexing. I wrote the routine so that when the array hit 100 elements it would fire the manager to assemble the document. I believed that I could limit the heap size this way, by allowing Java to garbage collect objects that had been used and discarded.
What I found instead was that none of the ojbects were garbage collected during the request, and my application eventually hit the heap size wall and died. I can only assume that objects instantiated during a request are not garbage collected until after that request has processed, even if the objects are discarded and never used again. I even tried explicit gc() calls, but that only made the application die faster with a "gc overhead limit reached" message. In those early tests, my document indexer never made it past 800 documents on an application with a 512 MB heap.
Today I decided to change the processing of documents. First, I created a new XML document with the root node required for indexing. I then created a new object for each document record from the database and called a routine to add each document to the XML document in serial fashion. I used the same variable name for the object, but I instantiated it for each document. At each 100 documents, I posted the XML documentto the search engine for indexing. Using this routine, the heap maxed out at 424 MB while indexing 5,000 documents, then dropped back to 67 MB after indexing finished.The process worked, but it still experienced a big jump in the heap size during processing.
Next, I instantiated a single object to hold my documents, inited the object with the document id and retrieved the document from the database, then passed it to the manager to add to the XML document. My hypothesis was that by re-using the same object over and over, I could minimize the growth in the heap.
Turns out my hypothesis was dead wrong. Not only did re-using the object not help, it seems to have hurt. The heap maxed out in the CF Server Monitor at 440 MB instead of 424 MB, and the system actually generated a 500 out of heap space error, although it did finish processing all 5,000 documents. Even worse, the heap didn't drop after indexing finished, and I had to manually run gc() from the Server Monitor to drop the heap back down.
The image below shows the heap size graph for the test using a new object for each document on the left and the test re-using the same object for each document on the right. Notice how the heap size drops periodically in the first test. I can only surmize that gc was able to remove objects that had been processed and were no longer needed, although that is purely speculation.


thread would need to be dropped before the memory could be collected - right? Have you tried experimenting with Cfthread?