Many moons ago I was involved in a project where I had to gather and process information from a number of remote offices and present it in a fixed format for another IT system to process. The company I was consulting with paid someone to do it manually. Some parts were scripted, but the scripts had a lot of hard-coded variables that made the system brittle. I asked them if they wanted an automated system, and after I described what I had in mind, they gave me the green light.
The first thing I noticed was the hard-coded variables. This company was a national retail chain and their stores where mostly the same, but like any large IT shop, they had different versions of software at different locations, and sometimes the software directories varied. Lots of little things. This whole mess was laid out in the source code and it was nasty. The person running the system spent most of their time managing the mess than actually doing any work. I focused on three things:
- Replace as many fixed variables as I could with logic that would fill them in dynamically.
- Assess and deal with common variations like product versions and locations.
- Determine common failures and either handle them in code or generate a *useful* error message.
No one likes failure, but you have to balance the time it takes to build a fully-automated, fault-tolerant system against the time it takes to get an adequately designed, fault-tolerant system running. Adequate is good enough to start with. You can always improve it later. While in the planning phase, identify critical points of failure and address them. Ideally, you try to automatically handle failure gracefully. If a server could come up with a conflicting IP address, solve that as part of the automation. If a hypervisor could become resource-starved, figure out how to find one that isn't. Those are the likely problems. It is unlikely that a VM image is corrupted and there may not be an easy way to automatically recover it, although you could probably automatically restore from backup. In your first pass, focus on the likely problems, and in subsequent improvements you can get more fancy.
For the failures that you can't address, make sure you create meaningful error messages for the user and IT so that they can easily figure out what happened and fix it. Obviously, you want to clean up any anything that needs undoing. For example, if you provision a network port then find you can't use it, you probably want to deprovision the network port. Better yet, don't provision anything until you have assembled all the components and verified they are ready to go. Further illustration:
After a few weeks, I had a pretty good working system. It was stable and reliable and after running it, I was down to tweaking little bits of code to make it better. When the system went into production, the company was thrilled. The admin running the system went from managing the mess to doing very little other than dealing with the few exceptions that I couldn't reasonably deal with in code. They actually got to work on more interesting stuff.
Then I get a call from the manager telling me that IT was going to bring up a new point of sale system based on Windows 3.1 (it was that long ago). The program wasn't going to change, but the locations on the POS machines were going to be radically different and he forgot to let me know. The new POS was being rolled, and could I make sure the reporting system still worked? He'd pay me lots of cash. I asked if he had tried it out yet. He hadn't, because his IT people were convinced that due to the directory changes, the reporting system would break.