Similarities between debugging a complex network and a complex program

Filed By: Robert Moir

Similarities between debugging a complex network and a complex program

It has long been a belief of mine that designing and building a complex network isn't very much different from designing and coding a complex program. Sometimes principles from one discipline can be used to easily illustrate difficult to see areas of the other.

Assertions & Error trapping

When you write a computer program you attempt (well you should) to debug your code to remove as many errors as possible trap remaining errors to handle them in a controlled manner so that the end user never sees them.

An assertion is simply a test to see if something is true before proceeeding to the next stage in the execution of a program. If it is true this is good, if it is false then we have an error situation. In effect you are forcing an error condition to be detected and dealt with by the programmer rather than leaving the error to "fester" and cause unpredictable results elsewhere in a system.

Do we do the same in networking? Sure. Well kind of... how many layers of abstraction would you like?

But consider this: What about a login script or some other mechanism that checks to see if your virus scanner is installed and up to date, allows logins if it is, and either screams for help somehow (bad) or automagically updates the scanner (better) before returning control to the user? The code for that login script won't look like a programmer's assertion but the logic is pretty much there.

The same things are important for networking engineers trapping networking issues and programmers debugging a program.

  • You should never change a state you are testing simply by the act of testing it. For example, for 'forensic' examination of a computer which has a suspected security breach you should use tools that operate "out of band" and do not change the state of the machine at all. This means that you cannot "boot" the machine and run tests on it from itself.
  • You cannot assume that the absence of an error flag means there is no error; it is possible for an error to occur in a way that subtly avoids the specific objects you are testing. Filling a network or a program up with assertions and error traps is not a substitute for testing.
  • You have to be very careful about testing the 'environment' in this way, or better still just avoid doing it. Any test that affects the platform your test is protecting can affect your test as well if it shares the same platform.

Most bugs are at the junctions.

A Junction in this case is any point where a system or a subsystem passes either data or control of a process to another. Junctions are the places where bugs often occur, and regardless of where they occur junctions are certainly where many bugs show up.

This applies to programming, where passing data between procedures gives you an unexpected result because of various problems with different functions not doing quite what you expected, or because your data is stored as an integer but you need it as a float or whatever.

With networks, the bugs that occur at the junctions between systems are usually authentication problems and things like that.

These problems are important because they are potential security problems, which all too often end up getting fixed by a system administrator in hurry getting forced into opening up access to a system wider than they should. This is obviously a case of a short term gain that can come back and haunt you with serious long term results.

Less obvious might be delays and bottlenecks in the junctions between systems, such as a web application running on a web server that needs to ask an overworked domain controller to perform authentication. This won't show up as a problem in testing because it will produce the expected results, but under heavy use you might see random problems where the odd authentication request times out or when users complain about slow-downs.

Fixing a bugged system is cheaper the earlier you do it.

Always fix bugs in the current part of your project before starting a new part. Every time you build part of a network system or a computer program on modules you've created earlier, you are making assumptions about the robustness, behaviour and characteristics of that previous bit of work.

This is obvious but what might not be so obvious at first is that when you are doing this, you are building on any bugs in the older work. To re-use an example I suggested earlier, if I have an authentication issue I cannot solve between my SQL server and my IIS server and I decide to fix that by making the user that IIS uses to talk to SQL a SQL admin then anything I build on top of that is going to assume this is the case.

If I stop doing any further work as soon as I find out about my SQL server authentication problem and stay with it until its solved then the only bit of my system that has to be changed to reflect the fix is the broken module itself.

If I don't stop working on other things, apply a 'quick and dirty' fix and move on then when I do get around to solving the problem I am probably going to find that most of the things I did after encountering that bug and applying that fix are also going to be suspect. They'll certainly need re-testing and may require a substantial bit of extra work in order to fix them up.

Beware of "free features".

  • no such thing as a free lunch, features always cost more than you think
  • always take the time to understand unexpected phenomena as your 'free feature' may rely on a bug elsewhere.

Simplicity is your friend.

I've always believed a well engineered system is one that is as simple as it can possibly be, but no simpler.

A system that is designed and built to be "simple" will run well under all circumstances, will generally be easier to understand and hence easier for others to expand or fix later as requirements change.

Always test the smallest changes.

The programming world is rife with stories of how a small change to a tiny bit of code managed to break a totally unconnected feature 'miles' away. I've already mentioned in this article how a small change to a domain controller that delays its response to authentication requests could cause a web app to fail elsewhere simply because the account its running under can no longer be authenticated in a timely manner. NEVER assume.

If something looks "untidy" and the person who made it that way doesn't normally work in that manner, then assume it might be "untidy" for a reason - so if you tidy things up you must test again to see if the untidy setup was actually a bugfix of some kind. Of course if you work with people whose work is always untidy then you already know that touching anything they've done is likely to bring a house of cards crashing down around you!

Things happen for a reason

This should be obvious, yet I find myself needing to go back to it. Computers are logical. Computers don't do things "just because". They don't "have bad days".

If something happens that seems really, really unlikely, then either you have either misunderstood the problem, or your test case isn't catching all of the variables. I won't labour the point. I'll just ask you to read the really really unlikely story that I linked to above. Firstly, its a good story, and secondly reading the story and its FAQ will give you far more insight than anything else I could say here.

Top