|
4th USENIX Windows Systems Symposium Paper 2000   
[Technical Index]
Ryan S. Wallach Avaya Communication
Abstract
IntroductionLucent Technologies’ DEFINITY® Enterprise Communications System is a highly reliable (99.999% uptime) large communications server. The software base contains several million lines of code. Like any large software system, each release of DEFINITY contains software bugs that are not discovered until the system has been installed at a customer’s premises. Many bugs are easily reproducible in the development lab from a customer’s description. It is impossible, however, to precisely reproduce the conditions of an installed system, and therefore some problems cannot be reproduced in the lab. When this happens, it is necessary to debug the system at the customer’s premises without disturbing its operation.DEFINITY is implemented as a collection of processes in a multitasking proprietary operating system. It contains a proprietary client/server debugger, Gemini, which lets support engineers at a Lucent site securely connect to a customer’s system and non-intrusively debug the software. Gemini consists of a small DEFINITY server process (known as the agent) which controls the target processes, and a UNIX client process (known as the host) that accepts commands and sends messages to the agent to execute them. The host process runs on the support engineer’s workstation. It has access to symbolic information for the DEFINITY processes, so it translates symbol names into addresses for the agent. Gemini supports the traditional model of debugging, that is, the user can set a breakpoint, wait for the process to halt, then single step it and examine data to find the cause of the bug. These features are often used during product development under controlled conditions. However, DEFINITY processes are interrelated and time-dependent. If a process is halted for more than a few milliseconds, the system could (in the course of error recovery) reinitialize itself, and this could disrupt the customer’s business. When debugging a live system, then, special debugging capabilities are needed. Gemini provides four key features:
The Search for a Debugger on Windows NTIn order to understand the rationale behind Gemini Lite’s design, it is important to understand the Win32 Debugging API and how this affects off-the-shelf debuggers for Windows NT.The Win32 Debugging APIWindows NT provides an API for developers to create their own debugger [1]. Typically, a debugger attaches to a running process by calling DebugActiveProcess() with the target process id as an argument. This registers the debugger with the operating system. The debugger then calls WaitForDebugEvent() which makes the calling thread of the debugger block until a debugging event is sent to it by the operating system. Windows initially sends events to the debugger to give it handles to each thread in the process being debugged. The debugger then receives events when threads in the target process hit a breakpoint, generate an unhandled exception, etc.[2]The Win32 debugging API is similar to the ptrace() system call interface used by some UNIX debuggers such as GDB [3] (some versions of UNIX do debugging through the /proc filesystem instead of through ptrace(), and there is nothing similar in Windows NT). To initiate a debug session, a UNIX debugger can call ptrace with a PTRACE_ATTACH request, which allows it to control the target process as if it were its parent. This is analogous to NT’s DebugActiveProcess() call. After the debugger has attached to the process, it receives SIGCHLDs when something happens to the target, or it can do a wait() or waitpid() to receive notification of events from the target. The wait() or waitpid() calls are analogous to NT’s WaitForDebugEvent() calls. The substantial difference between the Windows NT and
UNIX APIs is that a UNIX debugger can call ptrace() with a PTRACE_DETACH
request to disconnect the debugger from the target. The target continues
to run after the debugger is disconnected, and the parent-child relationship
between the debugger and the target is destroyed. Windows NT does not provide
a clean way to detach a target, i.e., there is no call to undo a DebugActiveProcess()
request. Furthermore, once the process that has initiated a debugging session
exits, Windows NT kills the processes that it was debugging [4]. Microsoft
plans to address this issue in a later release of Windows NT (in NT 6.0
or later) [5], but for now there is no workaround.
Debuggers InvestigatedWe investigated several Microsoft debuggers for Windows NT (WinDbg, Visual C++ IDE, and ntsd) [6] [7][8] to determine if they could be used for DEFINITY ONE. Each of these appears to use the Win32 API and kills the debugged process when it exits. Since the Microsoft provided debuggers could not satisfy our requirement that they be able to cleanly detach, we investigated commercially available debuggers such as GDB, NuMega’s SoftICE [9], and Oasys MULTI [10]. These exhibited the same problem.Many other debuggers have been built for multithreaded applications on different operating systems [11][12][13], and multi-process, non-intrusive debuggers have also been built [14]. These debuggers have all been built either by using the native debugger API provided by the operating system or extending it to meet the needs of the debugger. Building a debugger (or adopting one of these debuggers) on Windows NT using the debugging API is not acceptable for the reasons discussed above, and since Windows NT is not an open source operating system, it would not be possible to extend it to support one of these debuggers. GDB is perhaps the most common open-source debugger available, and we considered adapting it. Besides the fact that GDB uses the Win32 debugging API, there were other reasons that we chose not to use it. First, GDB can only debug a single process at a time and is intrusive [15]. This behavior stems from the core of GDB; modifying this would be, says the Cygnus White Paper on GDB, "a daunting task because of its complexities…". Furthermore, our project used the Microsoft Visual C++ 5.0 compiler, and the version of GDB available during our development cycle only supported COFF format symbolic information in the executables. The Microsoft compiler only emits CodeView symbolic information in executable files (and DLLs). Because no suitable off-the-shelf debugger could be found, we developed Gemini Lite. Gemini Lite is a general-purpose Windows NT debugger. It can be used to debug any NT process (not just DEFINITY) assuming the process is properly linked. Gemini Lite is non-intrusive and does not use the Win32 debugging API. It has the basic features of other debuggers, but its architecture permits it to have unattended action lists and to debug processes without killing them once it exits. Design of Gemini LiteOverviewThe Win32 API provides the basic mechanisms to implement basic debugging features in Gemini Lite. Table 1 shows which Win32 functions can be used to implement the core features of the debugger [16].
Table 1. Win32 support for debugger features
When DebugActiveProcess() is used to implement a debugger, Windows NT sends the debugger the handles to the desired threads. Without using this API, the only way for the debugger to have access to the thread handles is for it (or the part of it that actually controls other processes) to be integrated into the application code. Due to the size of the DEFINITY code base, it was not feasible to change the application code to accommodate the debugger. We used a client/server approach to separate the portion of Gemini Lite that interacts with the user from the portion that controls processes, which must be somehow linked into the application. The first part of Gemini Lite, the debugger process, is what the user runs to access the debugger. It acts like DEFINITY’s Gemini host, accepting input from the user and sending the input to the server to be parsed and executed. The server part of the architecture, which is the core of Gemini Lite, is a DLL that is linked with the applications that can be debugged (for the rest of this paper, "the DLL" refers to this). The DLL takes the place of the Gemini agent and is responsible for parsing and executing the commands sent by the debugger process. The DLL must be linked with both the debugger process and all the target processes. To force the application processes to link with the DLL without changing their code, they must be linked (with the Visual C++ linker) using the –include <symbol> directive and the appropriate export library for the DLL. The –include directive places a reference to the specified symbol (which is some globally exported symbol in the DLL) into the executable, which forces the DLL to be loaded when the process is run [18]. This limits the utility of the debugger somewhat, as it can only debug processes that are linked with the DLL, but for DEFINITY ONE this was an acceptable constraint. A general overview Gemini Lite’s architecture appears
in Figure 1. The figure illustrates
that all processes linked with the DLL share some common memory. The contents
of the shared memory are defined by the DLL; it contains exported functions
as well as shared data. The DLL defines another set of exported variables;
this set of variables has unique copies in each of the debugged processes.
The DLL also has non-exported code and data that are not visible to them.
The debugger itself is just another user process that
can call routines inside the DLL that implement its functions. These routines
are exported to all processes, not just the debugger, and this architecture
makes it possible for processes to call debugger routines when they hit
breakpoints, which will be discussed in detail below.
Shared Memory in the DLLThe shared memory area of the DLL contains data that must be shared between the debugger and the user processes. The data structures must be statically allocated at compile time because any memory dynamically allocated by one process would not be in the address space of other processes, including the debugger. Furthermore, there is no guarantee that the DLL will be mapped to the same address space in each user process, so any traditional data structure that uses pointers (e.g., linked lists or hash tables) is not suitable for the DLL. Array-based hash tables, lists, and queue template classes were defined to hold the shared data.The data structures contained in the shared memory area include:
The shared memory area also contains variables that track
whether the debugger is running. Since the debugger is a user process,
it also calls DllMain() when it starts. DllMain() checks the name of each
calling process, and when it finds the name of the debugger process (a
predefined name), it notes that the debugger is running. Another mechanism
could also be used in DllMain() to determine which user process is the
debugger. The DLL needs to know whether the debugger is attached because
certain status messages (e.g., breakpoints being hit) are written to a
queue in the shared memory area by functions in the DLL. The debugger has
a thread that looks at this queue and displays the messages to the user.
Process and Thread Registration in the DLLAs mentioned before, Windows NT forces each process and thread linked with the DLL to call DllMain() when they are created. NT passes a parameter to DllMain() that indicates the reason for the call. When new processes start up, they call DllMain() one or more times. A process’s first call to DllMain() has this parameter set to DLL_PROCESS_ATTACH. This notifies the DLL that the process (and its primary thread) has attached to the DLL. Subsequent calls by threads in the process to DllMain() set the parameter to DLL_THREAD_ATTACH and inform the DLL that additional threads in the process have been created.When the Gemini Lite DLL’s DllMain() is called with a DLL_PROCESS_ATTACH message, the DLL determines whether the process is the debugger or an application process. As mentioned above, if the process is the debugger, the DLL stores its pid in shared memory and sets a status variable in the DLL to reflect that the debugger is active. For the primary thread (and other threads) of user processes, the DLL creates an object to represent the thread in its shared memory area. The thread id and handle are stored in the object. DllMain() then sets the thread’s unhandled exception filter to point to a routine inside the DLL. Processes and threads also notify the DLL when they exit normally. When a thread exits, NT forces it to call DllMain() with a reason of DLL_THREAD_DETACH. Similarly, when a process exits, NT forces it to call DllMain() with a reason of DLL_PROCESS_DETACH. Note that the call with DLL_THREAD_DETACH is not made for all threads that are running when the process exits; only the DLL_PROCESS_DETACH call is made. When the DLL gets these calls, it frees the object and associated data structures that were allocated for the thread (or threads, if the process detached) including its breakpoints. In some circumstances (e.g., a call to TerminateProcess() or TerminateThread()), it is possible that processes and threads can be terminated without calling DllMain(). When this happens, the DLL does not know that the process or thread is gone, so it cannot free the related data structures. Because the tables holding the data are statically allocated and were sized to accommodate the number of threads and processes running in DEFINITY ONE, it is possible that they may fill up with information about processes and threads that no longer exist. If the tables are full when the DLL attempts to register a process, the DLL checks all registered threads to make sure that they are still valid, and it frees up entries that are no longer valid. Because the debugger is just another user process, the
DLL can detect when it exits through its call to DllMain(). In order to
prevent any application processes that have breakpoints set from stopping,
the DLL disables all breakpoints that may have been set in other processes
and it resumes execution of any threads that may have been stopped when
it detects that the debugger exited.
Debugger ProcessThe Gemini Lite debugger process has two threads. The main thread runs in a loop which prompts the user for commands, reads the command line, and calls the appropriate functions in the DLL to parse and execute the command. The second thread repeatedly locks the mutex protecting the message queue in the DLL shared memory, removes and displays any messages found in the queue, then releases the mutex. As a result, the user is immediately informed of events such as breakpoints being hit regardless of what he or she may be doing in the debugger (typing commands or viewing output).Implementation of Debugging FeaturesSymbolic DebuggingDebuggers like GDB typically read symbolic information for the process they are debugging from the executable file and build internal symbol tables for use by the debugger. Gemini Lite does not directly read the symbolic information for the processes and threads that it debugs. Instead, it relies on the Win32 symbol handling routines contained in IMAGEHLP.DLL [20]. These routines provide the capability to obtain an address in a running process from the name of a global symbol and vice-versa. The first time the user issues a command for a thread in a process that takes an address as a parameter, Gemini Lite calls SymInitialize() and passes it the handle to the process to initialize the symbol handler. It then loads the symbols for the process by enumerating all its modules and calling SymLoadModule() for each of them. Once the symbols have been loaded, Gemini Lite uses SymGetSymFromName() to translate global symbol names into an address or SymGetSymFromAddress() to translate an address into a global name.In Windows NT 4.0, IMAGEHLP.DLL does not provide facilities for translating a file name and line number into an address and vice versa. Microsoft has added the SymGetLineFromAddr() and SymGetLineFromName() functions to the Win32 API in Windows 2000 to accomplish this. In order to perform this function in Windows NT 4.0, a program would have to directly examine the CodeView debugging information in the executables (or in separate .DBG files). Time constraints only permitted us to display file and line number information in Gemini Lite’s disassembly routines. Other Gemini Lite commands (such as for setting breakpoints) cannot accept a file and line number in place of a text address. Use of the IMAGEHLP.DLL symbol handling functions requires that the DLLs and EXEs that make up the processes being debugged are compiled with debugging information. The debugging information must be compiled into the objects, not placed in a program database (PDB) file. However, it is usually undesirable to ship production code without stripping debugging information. To avoid this, we used the rebase tool shipped with Visual C++ to strip the debugging information from the compiled objects and place it in separate .DBG files. When the application needs to be debugged, the .DBG files are copied to the target machine, and then the _NT_SYMBOL_PATH environment variable is set before running Gemini Lite. This environment variable tells the IMAGEHLP.DLL symbol handling routines where to find the symbols. We ship the symbol files (in an encrypted form) with the DEFINITY ONE system. When support engineers need to debug, they use Windows RAS or a TCP/IP network to establish a connection to the system. They then decrypt the symbol files and run the debugger in a window directly on the target system. Stopping and Restarting ExecutionThe debugger can force threads to halt execution by calling SuspendThread() with the handle to the thread. The debugger obtains the handle from the shared memory area in the DLL. Before using the handle, the debugger must call DuplicateHandle() to obtain a handle in its context; the handle stored in the DLL is a handle in the context of the process that registered with the DLL. To resume execution of a thread, the debugger calls ResumeThread(), passing it the handle to the thread.Reading and Writing MemoryThe debugger commands that need to read or write memory do so by calling ReadProcessMemory() and WriteProcessMemory(). The debugger must have permission to read the memory of the processes it’s debugging. When the debugger is debugging processes started by the same user, this is not a problem. For DEFINITY ONE, we require that the debugger can be started only by a privileged user. Some processes we need to debug are started by a system service. Ordinarily, a user process does not have permission to access a system service. We modified the default discretionary access control lists (DACLs) of our system level processes to give the account that can run the debugger full access to them.Reading and Writing RegistersRegisters in a thread can only be read by reading the thread context. Debugger commands that need to read registers first use SuspendThread() to stop the thread unless it is already halted. They then call GetThreadContext() to retrieve the context of the thread, which includes the contents of the registers. After the context is obtained, ResumeThread() is called if the thread needs to continue execution.To write to a register, the context image returned by GetThreadContext() is modified to contain the updated register value, then SetThreadContext() is used to write the modified context back to the thread. BreakpointsLike most debuggers for software running on x86 processors, Gemini Lite sets breakpoints in a process by replacing the first byte of the instruction at the breakpoint address with 0xcc (INT3). The original instruction byte is saved in the record for the breakpoint in the shared memory area of the DLL so that it can be restored later. Because all threads in a process have the same address space, a breakpoint set in a process will affect all the threads in the process.Figure 2 illustrates the sequence of events that occurs when a thread hits a breakpoint. First, the thread raises a breakpoint exception when it executes the instruction at the breakpoint address. The system stores the thread’s context in a context record (the value of the EIP register in the record is set to the address where the thread encountered the exception) and forces the thread to call the appropriate exception filter. If no other exception filter handles breakpoint exceptions (which is a requirement for processes linked with the DLL), then the exception filter in the DLL (which was set as the unhandled exception for the process when it attached to the DLL) will be called. The exception filter receives a pointer to the context record as well as a pointer to an exception record, which contains the exception code, the address at which the exception occurred, and other information.
If the exception type is EXCEPTION_BREAKPOINT, then the thread checks the list of breakpoints in shared memory of the DLL to determine if a breakpoint was set at the exception address. If no breakpoint is found, there is no way for the thread to continue, so the filter will return EXCEPTION_CONTINUE_SEARCH. If a breakpoint is found, the thread determines if it should halt. Breakpoints may have a threshold stored in the object representing them that specifies the number of times the breakpoint is to be hit before a thread will stop. Also, the thread will only halt if the debugger is running, as indicated by the variable that the debugger sets when it registers with the DLL. If the thread determines that it must halt, it creates a message notifying the user that the breakpoint has been hit, and it puts it in the message queue for the debugger. It sets a variable in its record in the DLL’s shared memory indicating that it is suspended due to the breakpoint, and then it suspends itself by calling SuspendThread() with its thread handle as a parameter. The debugger process that the user is running, meanwhile, contains two threads. One reads and executes commands from the user, and the other checks the message queue from the DLL. After the target thread puts the message into the queue indicating that it hit the breakpoint, this thread of the debugger displays it to the user. When the user decides to resume execution of the thread, he or she gives the appropriate command to the debugger, which calls a function in the DLL. This function examines the record for the thread in shared memory. If the state of the thread indicates that it has been halted at a breakpoint, then the function gets the handle to the thread that is stored in shared memory and passes it to ResumeThread(). This wakes up the thread that hit the breakpoint. When the thread wakes up it is still in the exception handler, and the breakpoint instruction is still at the breakpoint address. The thread replaces the breakpoint instruction with the original instruction, sets status variables in its record in shared memory to indicate that it has been resumed after a breakpoint, and sets the Trap Flag in the image of the EFLAGS register stored in the saved thread context that was passed to the exception handler. It then returns EXCEPTION_CONTINUE_EXECUTION. This forces NT to restore its context from the saved image (with the modified EFLAGS register) and continue executing where the exception occurred. The thread executes the real instruction at the breakpoint address, and then, because the Trap Flag in the EFLAGS register is set, it generates a single step exception. Again, Windows NT forces the thread to call the unhandled exception filter. In the exception filter, the thread sees that the exception code is EXCEPTION_SINGLE_STEP. It checks the status variables in shared memory and figures out that it has single stepped after the previously hit breakpoint. The thread then saves the instruction at the breakpoint address, reinserts the breakpoint, and again the exception filter returns EXCEPTION_CONTINUE_EXECUTION, which lets the thread continue executing at the instruction after the breakpoint. The thread then continues executing until some other event occurs. Gemini Lite’s handling of breakpoints differs from the traditional implementation. In a debugger written using the Win32 API, for example, when a thread hit a breakpoint, the system would suspend it. A debugger doing a WaitForDebugEvent() would be woken up, and it would decide whether to keep the process halted or restart it with a call to ContinueDebugEvent() (after replacing the breakpoint instruction with the original instruction and setting EFLAGS appropriately). In Gemini Lite, the thread decides itself whether it should be suspended, and it suspends itself. In both cases, the debugger causes the thread to resume execution. In the Win32 case, the thread resumes where it was stopped by the system, in the application code. In Gemini Lite, the thread resumes execution in the exception handler. When it returns from the handler, the system causes it to resume executing where the exception was raised. Action ListsAn action list is a list of debugger commands to be executed when a breakpoint is hit. When the Gemini Lite user sets a breakpoint in a thread, he or she may also supply the action list. The action list is stored with the breakpoint information in the shared memory area in the DLL. In the exception filter, if the thread determines that a breakpoint has an action list, instead of calling SuspendThread(), it reads the list of action list commands from shared memory and passes them to the same function in the DLL that the debugger executable runs to execute commands that are input by the user. Since all the functions of Gemini Lite are also in the DLL, they can be executed just as if the user were giving them on the command line. The output from the commands is directed to a large circular buffer, also in shared memory. The output will stay in the buffer until the Gemini Lite user clears it. Note that if a breakpoint has an action list, Gemini Lite does not have to be running in order for the action list commands to execute, because the thread automatically resumes execution after the action list commands are run. This makes action lists very useful for unattended debugging. The user can set up the breakpoints with action list commands to dump data of interest, exit Gemini Lite, and come back later to examine the data.The user can also set a flag in the DLL’s shared memory area that the thread will check after it executes the action list commands. If the flag is set and the debugger is running, the thread will generate a message for the debugger that tells the user that the breakpoint was hit. This feature can be used in combination with an empty action list to let the user know that the thread executed code at a particular instruction without having to halt the thread. Single-SteppingWhen a thread is stopped, either after being halted by the user or by hitting a breakpoint, the user may wish to step through the execution of the program being debugged. Because Gemini Lite relies only on the IMAGEHLP.DLL symbol handler to read debugging information, it does not have access to the information that links an address to the program source file and line number. Consequently, Gemini Lite can only step through a program any number of assembly-language instructions at a time.The implementation of single stepping was seen above in the discussion of breakpoints. When the target thread is halted, the user’s single step command sets the Trap Flag in the EFLAGS register (by getting the thread context, modifying, and writing it back, if the thread is not halted after a breakpoint, or by modifying the saved context in the exception record, if it is), and resumes execution of the thread (by calling ResumeThread()). The user can specify the number of instructions to single-step; this number is stored in shared memory in the DLL. After resuming execution, the target thread executes one instruction and generates an exception, sending it into the exception filter in the DLL. If the thread has stepped the desired number of instructions, it puts a message in the queue for the debugger to inform the user that it halted, then it calls SuspendThread() on itself. Otherwise, the thread decrements the step count, returns from the exception filter, and continues stepping. After the thread is finally halted, the user can resume execution of the thread or single-step it again. As with breakpoints, when the thread is in the exception filter it checks to see if Gemini Lite is running before calling SuspendThread(). If the debugger is not present, the thread will not stop. This avoids the situation where a user requests a single step of a large number of instructions, but then exits the debugger before the stepping is completed. Related ToolsSince the IMAGEHLP.DLL routines only locate global symbol names, we needed a set of tools to use with Gemini Lite which could show us the layout of structures in memory, addresses of individual array elements, and addresses of global functions and variables. In the UNIX environment, these functions are provided by tools like objdump (from GNU) and nm. On Windows NT, the Microsoft provided tools to do these things (such as dumpbin) are part of Visual C++ and cannot be run without it. We developed a standalone set of tools to do these things. The development was difficult, in part, because Microsoft’s compilers emit symbolic information in a proprietary format (CodeView), and Microsoft does not provide any libraries for manipulating this information. We generated our own set of routines from Microsoft’s symbolic debugging information specification [21].ConclusionGemini Lite was used during the development of DEFINITY ONE to solve some difficult problems. In one case, an uninitialized variable was causing incorrect information to be displayed on DEFINITY’s administration terminal. We set breakpoints with empty action lists both where we knew the code had executed and where we thought it should be executing. When these breakpoints are hit, Gemini Lite puts a message into its output buffer. By looking at the buffer, we were able to see where the code failed to branch as we thought it should. At that point, we used an action list to display a variable that determined where the code branched. After seeing that the value in this variable could not have been set by the code that had executed, a close examination of the code showed that the variable had not been initialized.Our experience with Gemini Lite suggests some enhancements. First, Gemini Lite could be enhanced to read CodeView information from the processes it’s debugging and maintain its own symbol table. With this information, Gemini Lite would have a knowledge of variable type information, mapping of source files and line numbers to addresses, locations and names of local variables in functions, and more information that would enable it to be a source level debugger instead of an assembly level debugger. A networked client-server approach to Gemini Lite has also been proposed which would eliminate the need to keep the symbol files (.DBG files) on the system being debugged. AcknowledgementsI would like to thank Bhavesh Davda, Bill Lyford, and David Walters for their assistance during the development of Gemini Lite.References
|
This paper was originally published in the
Proceedings of the 4th USENIX Windows Systems Symposium,
August 3-4, 2000, Seattle, Washington, USA
Last changed: 29 Jan. 2002 ml |
|