Object instantiation and heap size in ColdFusion

I have been experimenting lately with an indexing routine for Apache Solr. The application I am using to run the routine is written in Model-Glue 2 and heavily leverages a component model for data access and business logic.

Solr allows a user to add documents to its index via HTTP POST of an XML document in a specified format. The engine allows batch adds by concatenating multiple documents into a single XML document, with a <doc> element for each document to be added to the index.

When I first wrote the indexing routine, I created an object for each document record from my database and passed the objects in an array to the manager code for assembly into an XML document for indexing. I wrote the routine so that when the array hit 100 elements it would fire the manager to assemble the document. I believed that I could limit the heap size this way, by allowing Java to garbage collect objects that had been used and discarded.

What I found instead was that none of the ojbects were garbage collected during the request, and my application eventually hit the heap size wall and died. I can only assume that objects instantiated during a request are not garbage collected until after that request has processed, even if the objects are discarded and never used again. I even tried explicit gc() calls, but that only made the application die faster with a "gc overhead limit reached" message. In those early tests, my document indexer never made it past 800 documents on an application with a 512 MB heap. 

Today I decided to change the processing of documents. First, I created a new XML document with the root node required for indexing. I then created a new object for each document record from the database and called a routine to add each document to the XML document in serial fashion. I used the same variable name for the object, but I instantiated it for each document. At each 100 documents, I posted the XML documentto the search engine for indexing. Using this routine, the heap maxed out at 424 MB while indexing 5,000 documents, then dropped back to 67 MB after indexing finished.The process worked, but it still experienced a big jump in the heap size during processing.

Next, I instantiated a single object to hold my documents, inited the object with the document id and retrieved the document from the database, then passed it to the manager to add to the XML document. My hypothesis was that by re-using the same object over and over, I could minimize the growth in the heap.

Turns out my hypothesis was dead wrong. Not only did re-using the object not help, it seems to have hurt. The heap maxed out in the CF Server Monitor at 440 MB instead of 424 MB, and the system actually generated a 500 out of heap space error, although it did finish processing all 5,000 documents. Even worse, the heap didn't drop after indexing finished, and I had to manually run gc() from the Server Monitor to drop the heap back down. 

The image below shows the heap size graph for the test using a new object for each document on the left and the test re-using the same object for each document on the right. Notice how the heap size drops periodically in the first test. I can only surmize that gc was able to remove objects that had been processed and were no longer needed, although that is purely speculation.

 

BEA Event-driven application server

I have spent the last three days at the Adobe MAX developer and designer conference in Chicago. In the sponsor exhibition pavilion I visited the BEA booth and was surprised to see BEA's a real-time application server for complex event processing (CEP) on offer. I didn't see a demo of the product, but I talked with the BEA reps and came away with the impression that they have a very interesting product offering.

CEP applications have a relatively narrow market at the moment, but as more data flows through critical real-time systems, CEP looks to play a more important role in the world of enterprise IT. CEP already plays a critical role in areas like stock trading analysis. I expect to see big ecommerce sites and healh care companies adpot CEP to manage analysis for large scale deployments. Very large scale enterprise search deployments could also benefit from CEP by using real-time anaylitcs on search transactions to manage search system optimization.

Next Generation Web Caching Servers

Tom's Hardware Guide just posted a review of two flash-based hard drives. While the drives have some flaws, such as relatively slow read performance, the read performance in the Web Server benchmark is nothing short of stunning. Though the reviewers recommend not using the drives in servers due to serious limitations in write performance, these drives could provide an incredible boost to performance for Web caching servers running Squid or other proxy caching software. The RAID 0 configuration tops at just under 5,500 read operations per second- amazing. While the size of the drives tested, 32 GB, will limit their use with very large volumes of data, Web sites and applications that serve huge amounts of traffic could replace entire racks of caching servers with traditional drives arrays with just a couple of servers running these flash-based drives.

Another great use of such high-performance disks would be to store production indexes for enterprise search systems. Applications like FAST ESP and Apache Solr rely on very large indexes to drive superior performance. While deploying the index to flash-based drives would be more time-consuming than with traditional drives, the performance of subsequent reads from the index should blow away that of traditional drives, even very expensive SAN system. 

Open source enterprise search with Apache Solr

If you are interested in enterprise search- massive scaling, faceted searching, and other such goodies, more than likely you have looked at commercial products like the FAST ESP platform from FAST Search and Transfer. ESP is a remarkably deep and broad product; like other enterprise-class products, licenses costs can be considerable. If you would like to take advantage of some of the basic features of enterprise search, but you can't afford to license a commercial product like FAST ESP, you should explore the open source Apache Solr plaform.

Solr started as a development project at CNet to enable some of the search/browse features you see on the CNet web site. In 2006, Solr became an incubator sub-project of the Apache Lucene project. Lucene is a search engine toolkit originally written in Java (now with ports to C and C#). On June 6th, 2007, the Solr project team released version 1.2, which adds significant functionality to the core application.

Solr is written in Java and can be run inside any standard Java servlet container such as Tomcat. 

IT Ecosystems - Part I

A little while ago I opened a discussion about IT ecosystems on a technical mailing list to which I belong. I posed four questions and asked for thoughts on the discussion. What follows is the first question, plus commentary from several other people, and my additional thoughts at the end. I will post the next three questions from the discussion in the following three days. I originally posted the questions  all at once.  I have pulled together the answers by question and edited them for brevity and relevance.


1. What role does enterprise search play in the eco-system? [My previous employer] implemented FAST ESP (
www.fastsearch.com) for our main customer web site, but the system is capable of much more. FAST is trying to position their product as a way to consolidate lookups across your business, e.g. if you want to look up a customer record, don't go into your CRM system, go to FAST and search. In part the functionality is database offloading, but it is something else as well. I am trying to find the useful limits of that metaphor. Is anyone else looking at that kind of functionality?

Dave Watts from Fig Leaf Software (www.figleaf.com):

Google has the same kind of approach, using their OneBox functionality. With
a GSA, you can write and upload OneBox modules, which let you integrate
custom application functionality into your standard search interface. For
examples of this, you can do searches on the public Google interface like
this:

movie: xxxxx [zipcode]

and you'll find movie theaters in or near my area code.

I'm working on OneBox modules right now for things like employee directory
search for our Google enterprise clients. This doesn't really offload
anything from your database in many cases, it just gives the user a quick
way to get to data without having to go to the standard interface for that
data. It's quite limited, and often you'd want your OneBox to give you links
to the actual system, rather than trying to fetch all the results you need
to see, or whatever.

But Google is very big on the idea of search being the user's primary
interface to a lot of things which don't normally fall into the common
conception of search. In a way, search becomes analogous to a command prompt
(!), where the user's ability to remember short key strings lets him do lots
of operations without going through the normal GUI. In some cases, the user
might not have to know anything at all; if you can figure out from the data
pattern what OneBox you want to invoke, that's ideal. For example, if you
type your 10-digit US phone number (with or without dashes) into Google, and
your number is publicly listed, you'll see your phonebook listing, and a
link to your address on Google Maps.

 

Rob Munn:

We're actually moving away from the Google solution. I like what Google does, it just doesn't do some things that we need in the ecommerce arena. The best thing about Google is that the appliances are damn near toasters- plug them in, configure IPs, etc., add URLs and other configuration data, and they just go. FAST is more like a Ferrari- it has 600 hp and more torque than you can believe, but it takes a small team of engineers to keep it running.

Brian Meloche:

For what it's worth, my company bought a Google appliance.  It sits there, basically unused, as our infrastructure team's had all sorts of issues getting it to work with our systems.  They might actually put it up on eBay.  I am hoping that they won't, as it may come in handy during our new intranet project.  I don't know if they've ever tried to use the search against the JDE backend to see what kind of results/performance we get.  I haven't been involved with any of the R&D efforts.

Robi  Sen:

Enterprise search can be huge.  I have been working lately with some organizations that have rather large systems and database and are trying to make access to structured and unstructured data not only easier for users but also trying to figure out what data users need to be aware of before they ask.  A lot of enterprise search now is not simply just indexing and retrieval of data but also machine learning via concepts like Bayesian analysis, link analysis, etc.  I think something I am seeing a lot where I play is not only the concept of being able to search for information I want in a reactive mode but tools that not only guess what else might be useful now but also alert me to data or information that I might need to be more efficient at what I am doing before I ask.  A good place to look into how search is being used in more novel ways is Carnegie Mellon University who has a top flight machine learning program.

Rob Brooks-Bilson:

Like Dave W, we're using the GSA here.  I know you had it, but decided to move away for various reasons.  Google's OneBox and Feeds approach offers a lot more functionality than was there a year ago.  Add to that adapters for various off-the-shelf CMS systems like SharePoint, and the GSA has a lot more to offer.  In our case, it's currently fitting the sweet spot.

 

Search is huge for the next generation of enterprise IT. The average enterprise today is just scratching the surface when it comes to exposing useful data inside the enterprise.  Various vendors have developed different strategies for dealing with the enterprise data question. Google has leveraged their expertise in Internet search algorithms to create dedicated appliances with additional features to connect to enterprise data repositories. FAST, Autonomy, Endeca, Verity and other players have developed enterprise software packages with features aimed squarely at reaching data embedded in proprietary systems like ERP and CRM packages from Oracle and SAP, among others.

According to Gartner Group, FAST, Endeca, and Autonomy are solidly out in the lead in the enterprise search space. (By way of disclosure, I have experience implementing FAST, Google, and Verity- although not with the newest Verity solutions). Google, of course, has more money than all of the other players combined and could potentially leverage this advantage into a killer app for enterprise search, though none as yet has emerged. As Robi Sen notes, search can go way beyond pure index and retrieval to provide more intelligence for users. Google clearly understands this power on the public Internet, as is reflected in their strategy to record more and more information about users in an attempt to provide more added value for Internet searches. You don’t have to think too far ahead to imagine that Google might just be planning to add such features to their enterprise search offering in the near future.

The other players are hardly sitting still. FAST already has enough power and flexibility to allow a creative architect to build virtually any solution that can be imagined, though the tools are more complex than the less feature-rich Google offering. Autonomy, Endeca, and Verity lean in the direction of FAST. There are also a host of other search appliance players in the market. I have included only Google here because of their market presence in public Internet search and their ability, like Microsoft, to enter virtually any market without regard to profitability.

BlogCFC was created by Raymond Camden. This blog is running version 5.8.001.