Colony on Railo - Initial Testing Looks Good

Today we did initial testing of the Colony platform on the open source Railo 3.1 application server. I am pleased to share our findings that so far Colony runs without any issues on Railo. While we expect most customers that deploy CFML application to continue to use the award winning Adobe ColdFusion 9 Application Server, we are testing Colony on Railo in order to provide a fully free and open source distribution of the Colony application platform. 

Who Owns Your Data?

News broke today that Amazon remotely erased copies of some George Orwell books from Kindle devices. Putting aside the irony of Orwell's books being erased, the question we must all ask ourselves is what the ramifications are of putting our data on devices and systems controlled by others. In our increasingly connected world of always on, wired and wireless access to everything from our books to our pesonal information, we should all ask a very central philosphical question - who owns our data?

Whether the issue is Amazon deleting a book you paid for on a device you own, government workers improperly accessing private records, private industry workers improperly accessing records of public figurescompanies attempting to assert ownership over data you place in their hands, or hackers stealing data from public and private databases, the twin issues of data ownership and security have become central themes in an emerging threat to the success of the Internet as a trusted medium. We believe that users own their data, and we are working on an exciting new product set that will address these fundamental issues in ways that put data ownership and security in the hands of users. 

Microsoft submits virtualization drivers to Linux kernel

In a small but great piece of news for cooperation between the world's largest software comany and the Linux community, Microsoft has submitted drivers for the Linux kernel that will help Linux virtual machines running on top of a Windows Hyper-V host to run more efficiently.

Kudos to Microsoft for the move.

Why Wolfram Alpha will not change the world

In case you haven't heard the story about the next "greatest thing since sliced bread", there is a new search engine, of sorts, called Wolfram Alpha. Wolfram is basically a big computation engine with a vast store of data to compute against. The press has been agog with the possibilities of the system, calling it a "Google killer" and more. Pardon me for my skepticism, but I don't think Wolfram will change the world.

Sure, it's a handy thing, a very handy thing, but it has significant limits. Getting data into the system is a manual affair that requires human curation, so from the very start of the project, its owners have a pipeline problem - the breadth of the engine's knowledge is limited by the amount of information their human agents can put into it. Unless they change that model, the pipeline problem never goes away. In fact, the problem only becomes bigger over time as humans generate more and more data for potential inclusion into the system.

Wolfram, in essence, has returned to the original Yahoo model of curated content, only using structured data and a more sophisticated search system. Yahoo gave up on curating data manually - it was just too inefficient and costly compared to automated indexing. While Wolfram will excel at answering the kind of complicated mathematical equations that it specializes in, I don't see it outshining Google.

Google's advantage (and disadvantage, as I have discussed previously) is the sheer volume of information in its index. I can search on Google and generally find an answer to a question within a few clicks. Why is that? In short, Google and other automated indexing engines rely on the millions upon millions of people around the world who contribute content to the Internet in the form of web pages, wikis, blogs, and many others. Wolfram, on the other hand, relies on its own staff of curators to ad data to the system.

And let's not forget about Twitter and other social media as the newest form of content contribution. If Twitter embodies Web 2.0, Wolfram Alpha embodies yet another take on Web 1.0. Don't get me wrong, Wolfram will be a boon to people doing some forms of research, but it will never live up to the hype that has been created around it.

Gartner Praises ColdFusion

Kristen Webb Schofield writes that a new Gartner report praises Adobe ColdFusion and recommends that agencies continue their investment in the platform, which they see as enjoying a bright future with Adobe as its steward. You can buy the report from the Gartner web site.

Introducing the Colony application platform

For the past three years, I have been working on and off on an open-source CFML-based Web application. It  started out as a simple system to store arbitrary structured and unstructured content. I started using it to build more and more complex Web applications, and over time it grew in size and scope. I thought about it for awhile as a content management system, but content management is not what I was aiming for, and not where the platform has really evolved.

After struggling with terminology and purpose, I started thinking about the application as an application platform. What is that? I see it as an implementation of typical application patterns in an integrated package that allows a develoepr to use it in whole or in part, building on the core libraries to create a new solution. 

Once I had the concept clear in my head, I started casting about for a name. After lots of pondering and brainstorming sessions with my colleagues on the CF-Community list, I decided to call the platform Colony. To me, Colony is all about staking out new territory on the Web, building compelling new services, and advancing the state of software.

Colony is also about shared effort and shared reward. To that end, we have just released the platform in alpha under the Apache Software License 2.0. You can get the alpha code and see more about the platform at www.cfcolony.org. The site is graphically challenged and light on content at the moment, but that wil cahnge soon. 

Pulseaudio in Ubuntu 8.10

As of the last couple of releases, popular Linux distro Ubuntu has switched the default sound system for the desktop to the PulsAaudio Sound Server. PulseAudio offers the promise of unified, abstracted access to a computer's sound capabilities, and from that standpoint it is a huge advance over previous solutions. However, the move to PulseAudio has been accompanied by a huge amount of frustration by end users with compatibility problems, crashes, and various conflicts with onboard sound, USB sound, and Adobe Flash and AIR.

Recently, I have experienced a number of these issues myself. (See the related posts for background on my transition to using Ubuntu on my desktop). In November I installed Ubuntu 8.10, and I have had some sound issues ever since. In the last few days, I have spent some time debugging these issues on my system, and I wanted to share my challenges and solutions.

Sound Hardware

I have two separate sound systems - onboard sound on my motherboard and a Bose Companion 3 USB sound system. For this type of configuration, the sound troubleshooting guides I have read recommend deactivating the onboard sound and using USB sound only.I tried that for awhile, but I had an issue. The Bose system has a microphone jack on the volume control, but Ubuntu seems not to see it, it only sees the output device. I need a microphone, so I enabled the onboard sound for that purpose. I have a headset/mic combo attached to the onboard sound system.

Skype

If you work remotely or travel on a regular basis like me, Skpe is a great solution for keeping in touch with people. It offers chat, voice over IP (VOIP) telephony, even video-conferencing. The latest Ubuntu client for Skype works reasonably well, but it tends to seize control of the pulseaudio process and eliminate sound for all other applications. My solution to the problem is to enable onboard sound (by removing any /etc/modprobe.d/blacklist entries and enabling the onboard sound in BIOS), assign USB sound to the first sound source in ALSA:

/etc/modprobe.d/alsa-base

install sound-slot-0 /sbin/modprobe snd_usb_audio

and setting USB sound as the default output sink in pulseaudio:

/etc/pulse/default.pa

 .nofail
set-default-sink alsa_output.usb_device_5a7_1020_noserial_if0_sound_card_0_alsa_playback_0 

 Note that your exact settings will depend on your particular hardware.

After that, in Skype -> Options -> Sounds Devices I set my onboard sound hardware (mine is defined as HDA NVidia (hw:NVidia,0)) as the device for Sound In, Sound Out, and Ringing. The ring will sound in my headset, but that's OK for now. Maybe I'll get one of those Skype phone sets from eBay. This solution works fine and it is the only solution I have found for my system that gets Skype working without taking over the pulseaudio system.

Amarok

I use Amarok for my music library.  I have Amarok configured to use the xine engine in Settings -> configure Amarok -> Engine. This setup works well, though I had to experiment with the settings for PulseAudio (making USB sound the default output sink) to get it to default to USB sound.

Adobe Flash/AIR

I do a lot of development with Adobe tools - ColdFusion and Flex in particular. I use YouTube, etc., so sound support for Flash tends to be important for me. As of right now, I have no sound support in Flash. I have installed the latest Flash 10 plugin (10.0.22.87) from Adobe and followed lots of recommendations from various guides about configuring PulseAudio and troubleshooting Flash sound problems. My best guess right now is that activating the onboard sound has caused a problem with Flash sound support. Since I need Skype more than I need YouTube, I'll be keeping my current configuration for now, but it would be great if Flash could use the PulseAudio system without any problems. 

Even when Flash sound support was working (with a previous Flash plugin), sometimes Flash would have trouble after playing a video, and at that point PulseAudio would lose the USB sound device completely. Killing Firefox (and any AIR applications like Twhirl) and restarting pulseaudio (pulseaudio-k; pulseaudio -D) enabled PulseAudio to find the USB device again.

Miscellaneous Settings

Here are a few things I have set according to the varous guides:

/etc/asound.conf

pcm.pulse {
type pulse
}

ctl.pulse {
type pulse
}

 ~/.asoundrc:

 pcm.!default {
         type asym
         playback.pcm {
                 type plug
                 slave.pcm "hw:0,0"
         }
         capture.pcm {
                 type plug
                 slave.pcm "hw:0,0"
       }
}

I am not 100% sure exactly how all of these settings interact with each other, but it seems for the most part to be a successful setup. Based on the number of reported issues I have seen with Flash sound support, I am going to wait until Ubuntu 9.04 is released in April to see if the problems are resolved.

Ubuntu continues to improve as a desktop OS, but from these experiences, you can see it still has a ways to go to be considered as an easy alternative to OS X or Windows (although the huge number of issues with Vista has certainly provided Ubuntu an opportunity to show its capabilities).

Don't let my experiences discourage you from using Ubuntu. You can run the Ubuntu LiveCD and give Ubuntu a try on your hardware without actually having to wipe out your current OS. If you are seriously considering switching, my advice for now is to either buy a new computer that comes pre-installed with Ubuntu or do your homework and install Ubuntu with supported hardware. 

Google's Mission to Penetrate the Deep Web

Google is building a software program that will conduct searches of public databases on the Web to try to ascertain their contents. The goal behind this move is to index and make available information that is not currently available - like flight schedules and fares, to use an example from the CNet article. This development raises two important questions for consideration. First, are there any legal issues for Google to conduct data mining from public databases? Second, who will pay for the bandwidth and CPU charges for Google's activities?

On the first question, it remains to be seen whether anyone will object on legal grounds to the searches. Google can certainly provide a way for companies to opt out of the searches using standard robot/user agent techniques currently employed to manage search engine crawlers, which may make the legal issues moot. 

On the second question,  there is a very real prospect that Google will add significant traffic to a site's search system, potentially costing the company maintaining the site both in bandwidth and server charges. For sites hosted in a cloud environment, those costs could be precisely quantified. So who will pay for the additional traffic? If Google provides an opt out solution that companies can easily deploy, one could argue that any company that neglects to opt out of the searches is by inference allowing Google to conduct the searches and so agreeing to incur the costs associated with the searches.

On the other hand, one could argue that Google has an obligation to proactively notify companies if it plans to change the way it indexes their systems in a way that may force them to incur additional costs, which effectively takes us back to the first question of legal issues. 

In the bigger picture, Google's move is just a first step in what will inevitably industry attempts to better expose and share data buried in databases around the world.  Though the Semantic Web has so far failed to attract a huge following, we can reasonably expect that either it or some other technology will take hold and begin to shape the next generation of knowedge sharing on the Internet.

Google Mis-step Shows Dangers of Market Dominance

As related by various news outlets, Google today experienced a one hour outage because an employee inadvertently marked the entire Internet as "malware". While the immediate story may be the inconvenience to users everywhere who were unable to use the Google search engine to view web sites during the outage, the bigger - and far more troubling - issue is the power a single company has over such a central component of the Internet, and how that power, in the wrong hands, could lead (intentionally or otherwise) to Very Bad Things happening on the Internet.

 How many people were aware, prior to today, that a single Google employee could label the entire Internet malware with such ease - or at all, for that matter? If there are such poor controls in place at Google that something so damaging could be done by accident, what kind of damage could malicious employees or hackers do on purpose? 

This issues raises the very real and very worrying prospect of the potential for corruption at Google and other companies with such market power. Want to get ahead of your competitors? Forget about buying ads, that's too expensive. Why not try to bribe a Google engineer to finess your search ranking to the top? Is that possible? What safeguards are in place to prevent it? Does anyone outside Google know? Does anyone inside Google know? 

Lastly, this story makes me even more wary of cloud computing. Sure, cloud computing offers some very cool upside- virtually unlimited scaling, relatively simple management of applications (as system hardware and services like clustering are abstracted away from the user), and good pricing. Still, I worry about the downside - that's part of my job, after all. What happens if an employee of the cloud provider makes a mistake that removes a bunch of applications from the cloud? Who is responsible for economic losses sustained because of such a mistake? CIOs might want to read the fine print on that cloud computing contract before jumping in with both feet.

Spamassassin - automating sa-learn with IMAP folders

Among the useful things we have found for our clients is a methodology for building a learning spam filter using Spamassassin and a mail server that supports IMAP folders such as dovecot. Simply adding Spamassassin with a standard configuration on incoming mail on a mail server can dramatically decrease the amount of spam users receive, but it will not catch nearly all spam sent to the server.

The reason for the lack of complete filtering is clear. Spammers play a cat-and-mouse game with spam filters, always attempting to modify messages in such a way as to avoid filtering. As filters change, spammers experiment until they find ways through, and they change their tactics as each new technique is detected. 

Because of this uncertainty with spam, Emergent Path recommends that clients who maintain their own mail servers implement the Bayesian filtering engine in Spamassassin and automate the learning process through the sa-learn script.

sa-learn is a command-line program that can be called and passed various arguments to classify messages as either ham (real messages) or spam (fake messages). Because it is a command-line program, it can be easily automated using cron on Unix/Linux systems. We recommend running a daily process on the mail server (depending on volume of mail and number of mail servers involved) that scans user-classified spam and ham using sa-learn to train Spamassassin.

A sample script might look something like the script below. This is a simple example and not necessarily a final production script:

 

!#/bin/bash

sa-learn --showdots --no-sync --spam /var/mail/domains/*/*/Maildir/.MakeSpam/cur/
sa-learn --showdots --no-sync --ham /var/mail/domains/*/*/Maildir/.MakeHam/cur/

rm /var/mail/domains/*/*/Maildir/.MakeSpam/cur/*
rm /var/mail/domains/*/*/Maildir/.MakeHam/cur/*

In this example script, each user who wants to tag spam creates an IMAP folder in the root of their account called MakeSpam. (The example assume a typical mail directory structure of /var/mail/domains/<domain_name>/<account>/Maildir/ for the root location of each user's mail folders.) For any spam messages that got through filtering to the inbox, the user drags those messages to the MakeSpam folder and leaves them. When the script aboe runs (via cron on the server), the messages will be classified as spam and then deleted. Over time this system will help Spamassassin improve its hit rate on spam messages.

Manually marking messages as ham that have been previously classified as spam may vary slightly depending on your SpamAssassin configuration. If Spamassassin is set up move spam to a Junk or Spam folder and simply add a header to the message, the user can simply move the message to the MakeHam folder, and when the script runs it will identify those messages as ham (good) and remember those settings for the future. If Spamassassin is set to create a new message and forward the original message as an attachment, the user may need to extract the original message from the attachment and place it in the MakeHam folder. 

Automated systems like this one can take time to develop and are sometimes tedious and error-prone to get right and keep right. We always recommend starting small with minimal functionality, proving that functionality over time, and adding to the functionality at a later date.

More Entries

BlogCFC was created by Raymond Camden. This blog is running version 5.8.001.