A simple plan: open source provisioning software


The only real problem with being unemployed (aside from the whole money thing) is that you have far too much time to think of new, time consuming projects. If you're in this industry and you're mind isn't twisting and crunching every time some fresh topic flies by your eyes then you're probably doomed to writing cobol accounting apps or Visual Basic tools for grandma.

So I'm sitting here, working on my book, articles for two magazines, and trying to port my open source apps to gtk+ 2.6 all while I send out a thousand resumes, when a new idea strikes me. You see, at my past two jobs I got the opportunity to work on blade systems – those next gen pizza boxes that take up less room and energy and produce less heat than conventional servers. So they say. They are the heir apparent to those rooms of PCs crunching frames of Titannic. They're cool to play with, but they're still just PCs on a stick. A bunch of epia-M's stacked and turned on their side could probably do the same thing, albeit with a lot more wiring.

Anyway, two jobs ago I built a test harness for factory burn-in testing of blade systems. It was a fun project and the first that let me build a serious server. It consisted of four parts: standalone tests, a client that ran on the blade (on linux), a server and a remote GUI. Tests were run as “flows”, with each flow consisting of one or more standalone tests or test groups. A test group was a set of standalone tests run in parallel. The GUI was used to set up, start, and monitor the test flows. A test flow could be set up for individual blades or a complete chassis (aka shelf) of blades. The test configurations were sent to the server who doled out the work assignments to each client. The client was pretty dumb. It only knew how to register itself with the server at boot time and then do a few simple things like start and stop tests and forward status information from the tests back to the server.

The whole design was pretty cool. Certainly one of the more complex designs I've done. But it had lots of problems. The worst was that the server was monolithic and could only handle about 70 blades total before it started loosing track of them. Had I seen Dan Kegel's article on the C10K problem first, mydesign would have been very different. Live and learn.

At my next job we did something slightly different. We built a management system for the blades. It provided for a ton of features: provisioning new software to the blades, monitoring system status, configuring scheduled tasks, etc. Cool idea. Bad design. First, it was tied to the hardware. There were all kinds of hooks to handle company specific features. And those hooks, while extremely useful, got in the way of long term general support. Customer demands were that the system manage not just the blades but also other hardware in their labs and production rooms.

A bigger problem was that it didn't scale well. Written primarily in PHP, it choked after a few hundred blades. It could actually get up to close to 1000 blades, but not simultaneously. And the problem was completely because of a monolithic design. Everything came through and was managed by a single server (a web server, granted, that had multiple instances running, but that didn't help). The original software was never architected. It was designed essentially on the fly. That worked for early prototypes but no one ever sat down and said “This is what's wrong and here are some architectures to fix it.” Instead, they patched it until the next version of the software got rewritten from scratch – and also designed on the fly.

So my idea was to combine these two projects. Take the lessons learned from my test harness regarding the C10K problem and integrate that into the lessons learned from the blade management project. The result is a generic system for large system managment, scalable to 10,000 blades. The trick to all of this will be delegation of duties. And I don't mean to project members, I mean to software. The central manager does very little other than act as a traffic cop for messaging, and even that only at a high level. Lower level processing of specific duties (provisioning, monitoring, etc.) are delegated to servers who handle that specific task. The entire system has to be modularized in a way that any component can be moved to a different piece of hardware and still function. That means when provisioning is getting hammered it can be moved to a dedicated server or even a group of servers.

Scaling. It's what both of those projects were lacking. And now I understand how to do it.

Another important fact here is that my test harness didn't really care that it was testing blades. And it shouldn't. The blade is just another computer. Does it matter if you're testing 3000 blades or 1000 Proliant Servers or 100 Clie's or 250 Nokia handsets? Not really. One end is the target, the other the management system. Hardware specifics should be handled by software that can either be in images that are provisioned to the targets or in software that is is provisioned to the target after a generic image is loaded (which is probably preferred). The design has to be generic so any bit of hardware can be added and the functionality has to be modularized so new features can be dropped in later.

And a test is just another task. It doesn't matter if you're testing hardware or running a mail server. From the management perspective they're just tasks to be doled out to targets. This is actually something both of those systems got right. They just both tried to manage it all monolithically. If that's even a word.

The high level view

A custom registration server and client are required but those are pretty easy. They don't need to be incredibly bright. Blades PXE boot into a registration image that tells the server “I'm here – tell me what to do.” The registration server (itself a series of servers like apache) will hand over duties to chassis-assigned servers that then assign task servers for individual blades. Blades feed status to task servers, who feed status to chassis servers who feed status to GUI's, who retrieve status via the registration server who tells the GUI where to get the status.

Registration and downstream messaging (servers to blades to tests) is request/reply oriented. Upstream messaging is connectionless and unsolicited up to the server that holds the status information. GUI's make downstream requests to get status for display. Pretty basic two way communications. There is just going to be a ton of it. Yeah, yeah – the types of messages will get complex. That's protocol stuff. But it's application layer stuff, and easy to architect. Wrap it in whatever you like – XML, SNMP, or even bubble wrap for all I care. Hell, I'd probably invent my own protocol. What fun is it if you can't invent something new for it?

And all this messaging has to go to the right places. Keeping all the destination ids organized will be left to a hierarchical design. The master server knows about next level chassis servers, who know about next level task servers, ad naseum. Sound like DNS? Just like it, but kind of on steroids with all the extra non-routing data being passed around.

GUI display is another issue – displaying a 1000 servers at a time is meaningless because people can't discern what that means unless the display is huge. We could get into a huge discussion on GUI's, but that's just a detail design issue, not a high level architecture problem.

A database backend is essential. But it's questionable if a single database should hold all data or if task specific db's should exist. I'd prefer the latter because if the task needs to be migrated to it's own server the db can come with it (or move to its own server). Modular and distributed. That's the only way to approach this from the highest point of view.

So why hasn't anyone written software to deal with it? I can't be the only person who's ever thought of this. It's not rocket science. It's pretty much software integration – lots of pieces of stuff already written that just need some custom components. There is a lot of work involved here, but nothing that a team of 5 or 6 couldn't do in a years time (at least for the core systems and a limited feature set).

Well, somebody has thought about it. A fairly decent article I found from InformIT.com says pretty much the same thing (with more specifics) about managing blade systems. In truth, the things it talks about are applicable to managing any distributed set of computing resources. But this is all just requirements specification. There isn't any detail on how it should work.

But maybe it's just that no one has had time to work on it yet. Maybe everyone is too hung up on legacy network tools to start writing new tools for more generic services. Maybe I should start an open source project to work on it.

Yeah. That's what I need. Another project to work on. I'll just give up food and sleep. I can fit it in.