In the odd world that is Twitter, #fail is a tag you put on your tweet
when something goes wrong in your life, at your job or when flying your
least favorite airline. What do you do to avoid #fail in your
storage infrastructure? The most important thing you can do when dealing with storage failure is
to make sure you are prepared for something to go wrong before it ever
happens.
You're in IT. It is not a matter of if something will fail it is a
matter of when it will fail. The number one thing that you can do to
make sure you are prepared for a failure is to know what you have in
that infrastructure. Whether you try to fix the problem yourself, or if you
bring in an expert, the first thing that people are going to ask for is
an inventory of what you have so diagnosis can begin.
An inventory is not the latest copy of your data center diagram that you
have spent hours on. While a good start, this really does not give the
details that someone is going to need to begin diagnosing the problem.
What is needed is a detailed configuration of every HBA, switch port,
inter-switch link (ISL), how the storage ports are configured and of
course how the storage itself is configured.
It is also best if this information is captured frequently, preferably
in real time by some sort of analysis tool (in other words, not in a
spreadsheet). Spreadsheets are not IT diagnostic tools. We've seen
troubleshooting projects where the inventory spreadsheet was more than
six months old and not updated since before the server virtualization
project was started. Things had changed. Candidly, if your inventory is
more than a few weeks old, especially in a virtualized environment, you
probably shouldn't bother having one. A re-inventory is going to have to
be performed, so you are better off just budgeting for that every time a
problem arises in the environment. The value of real time capture is it can provide clues of what was
changing in the environment in the time leading up to the failure event.
Those changes can often provide a clue to what went wrong. Often these
tools can capture physical errors being logged by the system which again
can provide some insight into what went wrong. Most importantly though,
real time capture can help you prevent a #fail before it ever happens.
The problem with most infrastructure hardware, storage hardware and
their software components is not that they don't provide enough
diagnostic information, but that they provide too much, and as a result,
the important information is lost in the shuffle. What these tools can
do is highlight when a message really needs your attention or when a
combination of slightly related messages are indicative of a failure. There is plenty more to do beyond developing an accurate inventory to
help get through a storage failure, but knowing what you have is a
critical first step.