System Integration no matter the approach; modular, big bang, regressive, controlled or adhoc can often be one of the most critical phases of a project. This is often amplified when the final product consists of many software modules brought together in unison through a number of non-trivial interfaces. In our current world today we have many different interfaces, ranging from VME, PCI and PC104 buses, to Ethernet, fibre, ATM, serial, wireless, Bluetooth and other possible proprietary buses that will provide the final product with the technological edge to make it “tomorrow’s must have item”.
In such a changing dynamic world, the fundamental concepts we gleaned from our forefathers who punched holes in cards to write programs has remained true. If you don’t know what your sending and receiving over an interface, then other than through pure luck, or an extended integration phase composed of ‘tiger teams’ and an infinite budget, don’t expect the problem solved quickly. In today’s market, the once simplistic embedded processor with a few interrupts now has every interface conceived to man compressed into a single microchip no bigger than your thumbnail. Without a strategy to attack this problem, the harsh reality is that your project is doomed to fail. So lets take a look at how we can we solve this problem.
We need to define rules:
- You need to capture data at interfaces – if you can’t see it, you can’t debug it.
- You need to understand your data in real time – if you need a long winded process to archive and restore, you may miss the core problem.
- You need to be able to quickly create scenarios in your debug environment to see the response of your system to particular events.
So in summary Observe, Understand, & Simulate.
So if we are observing the system, we want to be able to passively comprehend our system without intruding on its behaviour. After all, one must draw back to first principles and Heisenberg’s Uncertainty Principle.
What do we need to observe in our system?
- System Errors – User Errors, and Kernel Errors
- Events – interrupts, semaphores, process context switch
- Data – messages, messages, MESSAGES!!!
The latter is of course the bloodline of a well designed system.
If your system is doing any real time I/O processing, then the data being processed, (which can these days be in the gigabits/second) needs to have a threshold mechanism. That is, if the system is in a standby state, no one is concerned with its behaviour, but more often than not it’s when we enter a particular operational state one or two interfaces ramp up in usage and Murphy’s Law takes place. 80% of our problems exist in 20% of our system. A good start then is to be able to remove the 80% we don’t care about. the filtering mechanism in our observation tool will save us time and space, and of course limit our overhead on the system.
Now that we have our data, we need to be able to comprehend the system. There’s no point letting the system execute for many hours only ro realise the problem occured in the first few minutes of running. Through the use of standard Operating system and programming concepts, one can prebuild a knowledge of the system and use this to study the contents of the data. The power of being able to check on a message being transmitted over an interface, and see it being decoded into human readable form can put any engineer into a state of euphoria. Almost as if your system was speaking to you, leading you down the path of its issues. At this point, the debugging effort almost falls in line with solving the simplest of crossword puzzles. Just check your IDD’s (Interface Description documents) and walk your way through the problem. Even better, once you have it working, why not archive the trace data and use it the next time you make a change and plan to regression test your system. Suddenly our complex system doesn’t look so complex.
While observing the behaviour of our system is a big step forwards in our debugging process, there is still room for improvement when the problem is intermittent or even worse, it occurs at some obscure point in time (perhaps after several hours of execution). How do we address this situation? Perhaps we can have someone by the system ensuring that the data can be logged at the very instant of the failure? How about some intelligence? Introduce the concepts of states into your system. Have your debug tool understand what states your system is in, and at each state transition it will store only the required data. Even better, what if we have a system error, what do we do if no one is there to debug it? We surely can’t have your source code debugger hooked in from the word go. So many places to look, and of course let’s not forget, lose any semblance of real time in the system. Yes our old friend Heisenberg again. Well how about if our tool could freeze the system, at that point when the error occurred? A system level break point spanning across multiple architectures and processes. The system could still service low level needs, but at a process level it could be halted waiting for an an engineer to step it through the error with a source code debugger. Debugging our system in this way ensures that we are only interfering with it during the erroneous moments.
Now lets take the situation where we have an engineer in the lab. We think we know where the problem is, but what can we do? Why not have a smart scripting interface that allows us to quickly mimic an interface and inject scenarios into the system. Imagine, interfacing two software sub-systems, one developed internally, the other from an external supplier. Why not use the observation tool to also be the stub, or even use it as a work around to over come the error in the integration of the system between the two sub-systems. Simply feed into the modules the required data sequence, and then allow the system to progress and communicate as normal. Need to verify a system thread? Why not just inject a single message which can be quickly edited to observe the behaviour of the thread.
Developing a test procedure? Why not script it? Need an event injected at 10 mins 56 seconds after another event? Script it, then archive the system transactions as golden data for use in system confirmation or regression testing. Python has become a commonly used scripting/prototyping language. Python also has a library which can be linked into an existing application. This library can provide python access to internal calls within the application. Using this library, the application can run any python script, and allow the script to make direct calls into the internal functions inside our application. Integrating the Python library into our debugging tool, would allow us to quickly script our tools, and develop regression-able test procedures.
We have discussed the 3 rules that are critical to System Integration; Observe, Understand and Simulate. Most systems today already have the required pieces available to build a tool that is capable of helping us achieve this. The items just haven’t been pieced together, or thought of early enough in the process. If you are planning the integration of a complex system, take the time to identify where potential system errors can come from, what types of events in the system can trigger specific behaviours, and what interfaces are required for communicating messages. The use of the proposed tool we have discussed in this article, will ensure when a fault occurs, you’ll be ready to debug it quickly and thoroughly.