Current crypto implementations rely on software running under general-purpose
operating systems alongside a horde of untrusted applications, ActiveX
controls, web browser plugins, mailers handling messages with embedded active
content, and numerous other threats to security, with only the OS’s (often
almost nonexistant) security to keep the two apart. This paper presents a
general-purpose open-source crypto coprocessor capable of securely performing
crypto operations such as key management, certificate creation and handling,
and email encryption, decryption, and signing, at a cost one to two orders of
magnitude below that of commercial equivalents while providing generally
equivalent performance and a higher level of functionality. The paper examines
various issues involved in designing the coprocessor, and explores options for
hardware acceleration of crypto operations for extended performance above and
beyond that offered by the basic coprocessor’s COTS hardware.
1. Problems with Crypto on End-user Systems
The majority of current crypto implementations run under general-purpose
operating systems with a relatively low level of security, alongside which
exist a limited number of smart-card assisted implementations which store a
private key in, and perform private-key operations with, a smart card.
Complementing these are an even smaller number of implementations which perform
further operations in dedicated (and generally very expensive) hardware.
The advantage of software-only implementations is that they are inexpensive and
easy to deploy. The disadvantage of these implementations is that they provide
a very low level of protection for cryptovariables, and that this low level of
security is unlikely to change in the future. For example Windows NT provides
a function ReadProcessMemory which allows a
process to read the memory of (almost) any other process in the system (this
was originally intended to allow debuggers to establish breakpoints and
maintain instance data for other processes [1]), allowing both
passive attacks such as scanning memory for high-entropy areas which constitute
keys [2] and active attacks in which a target processes’ code
or data is modified (in combination with
VirtualProtectEx, which changes the protection on another processes’
memory pages) to provide supplemental functionality of benefit to a hostile
process. By subclassing an application such as the Windows shell, the hostile
process can receive notification of any application (a.k.a. "
target") starting up or shutting down, after which it can apply the
mechanisms mentioned previously. A very convenient way to do this is to
subclass a child window of the system tray window, yielding a system-wide hook
for intercepting shell messages [3]. Another way to obtain
access to other processes’ data is to patch the user-to- kernel- mode jump
table in a processes’ Thread Environment Block (TEB), which is shared by all
processes in the system rather than being local to each one, so that changing
it in one process affects every other running process [4].
Although the use of functions like ReadProcessMemory
requires Administrator privileges, most users tend to either run their
system as Administrator or give themselves equivalent privileges since it’s
extremely difficult to make use of the machine without these privileges. In
the unusual case where the user isn’t running with these privileges, it’s
possible to use a variety of tricks to bypass any OS security measures which
might be present in order to perform the desired operations. For example by
installing a Windows message hook it’s possible to capture messages intended
for another process and have them dispatched to your own message handler.
Windows then loads the hook handler into the address space of the process which
owns the thread which the message was intended for, in effect yanking your code
across into the address space of the victim [5]. Even simpler
are mechanisms such as using the
HKEY_LOCAL_MACHINE\Software\Microsoft\Windows
NT\CurrentVersion\Windows\AppInit_DLLs key, which specifies a list of
DLLs which are automatically loaded and called whenever an application uses the
USER32 system library (which is automatically used by
all GUI applications and many command-line ones). Every DLL specified in this
registry key is loaded into the processes’ address space by
USER32, which then calls the DLL’s DllMain
function to initialise the DLL (and, by extension, trigger whatever
other actions the DLL is designed for).
A more sophisticated attack involves persuading the system to run your code in
ring 0 (the most privileged security level usually reserved for the OS kernel)
or, alternatively, convincing the OS to allow you to load a selector which
provides access to all physical memory (under Windows NT, selectors 8 and 10
provide this capability). Running user code in ring 0 is possible due to the
peculiar way in which the NT kernel loads. The kernel is accessed via the int
2Eh call gate, which initially provides about 200 functions via NTOSKRNL.EXE but is then extended to provide more and more
functions as successive parts of the OS are loaded. Instead of merely adding
new functions to the existing table, each new portion of the OS which is loaded
takes a copy of the existing table, adds its own functions to it, and then
replaces the old one with the new one. To add supplemental functionality at
the kernel level, all that’s necessary is to do the same thing [6
]. Once your code is running at ring 0, an NT system starts looking a lot
like a machine running DOS.
Although the problems mentioned so far have concentrated on Windows NT, many
Unix systems aren’t much better. For example the use of ptrace with the PTRACE_ATTACH option followed by the use
of other ptrace capabilities provides similar
headaches to those arising from ReadProcessMemory
. The reason why these issues are more problematic under NT is that
users are practically forced to run with system Administrator privileges in
order to perform any useful work on the system, since a standard NT system has
no equivalent to Unix’s su functionality and, to
complicate things further, frequently assumes that the user always has
Administrator privileges (that is, it assumes it’s a single- user system with
the user being Administrator). While it’s possible to provide some measure of
protection on a Unix system by running crypto code as a daemon in its own
memory space, the fact that the Administrator can dynamically load NT services
(which can use ReadProcessMemory to interfere
with any other running service) means that even implementing the crypto code as
an NT service provides no escape.
1.1. The Root of the Problem
The reason why problems like those described above persist, and why we’re
unlikely to ever see a really secure consumer OS is because it’s not something
which most consumers care about. One recent survey of Fortune 1000 security
managers showed that although 92% of them were concerned about the security of
Java and ActiveX, nearly three quarters allowed them onto their internal
networks, and more than half didn’t even bother scanning for them [7]. Users are used to programs malfunctioning and computers crashing
(every Windows NT user can tell you what the abbreviation BSOD means even
though it’s never actually mentioned in the documentation), and see it as
normal for software to contain bugs. Since program correctness is difficult
and expensive to achieve, and as long as flashiness and features are the major
selling point for products, buggy and insecure systems will be the normal state
of affairs [8]. Unlike other Major Problems like Y2K (which
contain their own built-in deadline), security generally isn’t regarded as a
pressing issue unless the user has just been successfully attacked or the
corporate auditors are about to pay a visit, which means that it’s much easier
to defer addressing it to some other time [9]. Even in cases
where the system designers originally intended to implement a rigorous security
system employing a trusted computing base (TCB), the requirement to add
features to the system inevitably results in all manner of additions being
crammed into the TCB, with the result that it is neither small, nor verified,
nor secure.
An NSA study [10] lists a number of features which are
regarded as "crucial to information security" but which are absent
from all mainstream operating systems. Features such as mandatory access
controls which are mentioned in the study correspond to Orange Book B-level
security features which can’t be bolted onto an existing design but generally
need to be designed in from the start, necessitating a complete overhaul of an
existing system in order to provide the required functionality. This is often
prohibitively resource-intensive, for example the task of reengineering the
Multics kernel (which contained a "mere" 54,000 lines of code) to
provide a minimised TCB was estimated to cost $40M (in 1977 dollars) and was
never completed [11]. The work involved in performing the
same kernel upgrade or redesign from scratch with an operating system
containing millions or tens of millions of lines of code would make it beyond
prohibitive.
At the moment security and ease of use are at opposite ends of the scale, and
most users will opt for ease of use over security. JavaScript, ActiveX, and
embedded active content may be a security nightmare, but they do make life a
lot easier for most users, leading to comments from security analysts like
"You want to write up a report with the latest version of Microsoft Word
on your insecure computer or on some piece of junk with a secure
computer?" [12], "Which sells more products: really
secure software or really easy-to-use software?" [13],
"It’s possible to make money from a lousy product [...] Corporate cultures
are focused on money, not product" [14], and "The
marketplace doesn’t reward real security. Real security is harder,
slower and more expensive, both to design and to implement. Since the buying
public has no way to differentiate real security from bad security, the way to
win in this marketplace is to design software that is as insecure as you can
possibly get away with [...] users prefer cool features to security" [15]. In many cases users don’t even have a choice, if they
can’t process data from Word, Excel, PowerPoint, and Outlook and view web pages
loaded with JavaScript and ActiveX, their business doesn’t run, and some
companies go so far as to publish explicit instructions telling users how to
disable security measures in order to maximise their web-browsing experience [
16]. Going beyond basic OS security, most current security
products still don’t effectively address the problems posed by hostile code
such as trojan horses (which the Orange Book’s Bell-LaPadula security model was
designed to combat), and the systems the code runs on increase both the power
of the code to do harm and the ease of distributing the code to other systems.
This presents rather a gloomy outlook for someone wanting to provide secure
crypto services to a user of these systems. In order to solve this problem, we
adopt a reversed form of the Mohammed-and-the-mountain approach: Instead of
trying to move the insecurity away from the crypto through various operating
system security measures, we instead move the crypto away from the insecurity.
In other words although the user may be running a system crawling with rogue
ActiveX controls, macro viruses, trojan horses, and other security nightmares,
none of these can come near the crypto.
1.2. Solving the Problem
The FIPS 140 standard provides us with a number of guidelines for the
development of cryptographic security modules. NIST originally allowed only
hardware implementations of cryptographic algorithms (for example the original
NIST DES document allowed for hardware implementation only [17
][18]), however this requirement was relaxed somewhat in
the mid-1990’s to allow software implementations as well [19]
[20]. FIPS 140 defines four security levels ranging from
level 1 (the cryptographic algorithms are implemented correctly) through to
level 4 (the module or device has a high degree of tamper-resistance including
an active tamper response mechanism which causes it to zeroise itself when
tampering is detected). To date only one general-purpose product family has
been certified at level 4 [21].
Since FIPS 140 also allows for software implementations, an attempt has been
made to provide an equivalent measure of security for the software platform on
which the cryptographic module is to run. This is done by requiring the
underlying operating system to be evaluated at progressively higher Orange Book
levels for each FIPS 140 level, so that security level 2 would require the
software module to be implemented on a C2-rated operating system. Unfortunately
this provides something if an impedance mismatch between the actual security of
hardware and software implementations, since it implies that products such as a
Fortezza card [22] or Dallas iButton (a relatively high-
security device) [23] provide the same level of security as a
program running under Windows NT. It’s possible that the OS security levels
were set so low out of concern that setting them any higher would make it
impossible to implement the higher FIPS 140 levels in software due to a lack of
systems evaluated at that level.
Even with sights set this low, it doesn’t appear to be possible to implement
secure software-only crypto on a general-purpose PC. Trying to protect
cryptovariables (or more generically security-relevant data items, SRDI’s in
FIPS 140-speak) on a system which provides functions like ReadProcessMemory seems pointless, even if the system does
claim a C2/E2 evaluation. On the other hand trying to source a B2 or more
realistically B3 system to provide an adequate level of security for the crypto
software is almost impossible (the practicality of employing an OS in this
class, whose members include Trusted Xenix, XTS 300, and Multos, speaks for
itself). A simpler solution would be to implement a crypto coprocessor using a
dedicated machine running at system high, and indeed FIPS 140 explicitly
recognises this by stating that the OS security requirements only apply in
cases where the system is running programs other than the crypto module (to
compensate for this, FIPS 140 imposes its own software evaluation requirements
which in some cases are even more arduous than the Orange Book ones).
An alternative to a pure-hardware approach might be to try to provide some form
of software-only protection which attempts to compensate for the lack of
protection present in the OS. Some work has been done in this area involving
the obfuscation of the code to be protected, either mechanically
[ 24] or manually [25]. The use of mechanical
obfuscation (for example reodering of code and insertion of dummy instructions)
is also present in a number of polymorphic viruses, and can be quite
effectively countered [26][27]. Manual
obfuscation techniques are somewhat more difficult to counter automatically,
however computer game vendors have trained several generations of crackers in
the art of bypassing the most sophisticated software protection and security
features they could come up with [28][29][
30], indicating that this type of protection won’t provide
any relief either, and this doesn’t even go into the portability and
maintenance nightmare which this type of code presents (it is for these reasons
that the obfuscation provisions were removed from a later version of the CDSA
specification where they were first proposed [31]). There
also exists a small amount of experimental work involving trying to create a
form of software self- defence mechanism which tries to detect and compensate
for program or data corruption [32][33]
[34][35], however this type of self-defence
technology will probably stay restricted to Core Wars Redcode programs for some
time to come.
1.3. Coprocessor Design Issues
The main consideration when designing a coprocessor to manage crypto operations
is how much functionality we should move from the host into the coprocessor
unit. The baseline, which we’ll call a tier 0 coprocessor, has all the
functionality in the host, which is what we’re trying to avoid. The levels
above tier 0 provide varying levels of protection for cryptovariables and
coprocessor operations, as shown in Figure 1. The minimal level of coprocessor
functionality, a tier 1 coprocessor, moves the private key and private-key
operations out of the host. This type of functionality is found in smart cards,
and is only a small step above having no protection at all, since although the
key itself is held in the card, all operations performed by the card are
controlled by the host, leaving the card at the mercy of any malicious software
on the host system. In addition to these shortcomings, smart cards are very
slow, offer no protection for cryptovariables other than the private key, and
often can’t even protect the private key fully (for example a card with an RSA
private key intended for signing can be misused to decrypt a key or message
since RSA signing and decryption are equivalent).
Figure 1: Levels of protection offered by crypto hardware
The next level of functionality, tier 2, moves both public/private-key
operations and conventional encryption operations along with hybrid mechanisms
such as public-key wrapping of content-encryption keys into the coprocessor.
This type of functionality is found in devices such as Fortezza cards and a
number of devices sold as crypto accelerators, and provides rather more
protection than that found in smart cards since no cryptovariables are ever
exposed on the host. Like smart cards however, all control over the devices
operation resides in the host, so that even if a malicious application can’t
get at the keys directly, it can still apply them in a manner other than the
intended one.
The next level of functionality, tier 3, moves all crypto-related processing
(for example certificate generation and message signing and encryption) into
the coprocessor. The only control the host has over processing is at the level
of "sign this message" or "encrypt this message", all other
operations (message formatting, the addition of additional information such as
the signing time and signers identity, and so on) is performed by the
coprocessor. In contrast if the coprocessor has tier 1 functionality the host
software can format the message any way it wants, set the date to an arbitrary
time (in fact it can never really know the true time since it’s coming from the
system clock which another process could have altered), and generally do
whatever it wants with other message parameters. Even with a tier 2
coprocessor such as a Fortezza card which has a built-in real-time clock (RTC),
the host is free to ignore the RTC and give a signed message any timestamp it
wants. Similarly, even though protocols like CSP which is used with Fortezza
incorporate complex mechanisms to handle authorisation and access control
issues [36], the enforcement of these mechanisms is left to
the untrusted host system rather than the card(!). Other potential problem
areas involve handling of intermediate results and composite call sequences
which shouldn’t be interrupted, for example loading a key and then using it in
a cryptographic operation [37]. In contrast, with a tier 3
coprocessor which performs all crypto-related processing independent of the
host the coprocessor controls the message formatting and the addition of
additional inforation such as a timestamp taken from its own internal clock,
moving them out of reach of any software running on the host. The various
levels of protection when the coprocessor is used for message decryption are
shown in Figure 2.
Figure 2: Protection levels for the decrypt operation
Going beyond tier 3, a tier 4 coprocessor provides facilities such as command
verification which prevent the coprocessor from acting on commands sent from
the host system without the approval of the user. The features of this level
of functionality are explained in more detail in the section on extended
security functionality.
Can we move the functionality to an even higher level, tier 5, giving the
coprocessor even more control over message handling? Although it’s possible to
do this, it isn’t a good idea since at this level the coprocessor will
potentially need to run message viewers (to display messages), editors (to
create/modify messages), mail software (to send and receive them), and a whole
host of other applications, and of course these programs will need to be able
to handle MIME attachments, HTML, JavaScript, ActiveX, and so on in order to
function as required. In addition the coprocessor will now require its own
input mechanism (a keyboard), output mechanism (a monitor), mass storage, and
other extras. At this point the coprocessor has evolved into a second computer
attached to the original one, and since it’s running a range of untrusted and
potentially dangerous code we need to think about moving the crypto
functionality into a coprocessor for safety. Lather, rinse, repeat.
The best level of functionality therefore is to move all crypto and security-
related processing into the coprocessor, but to leave everything else on the
host.
2. The Coprocessor
The traditional way to build a crypto coprocessor has been to create a complete
custom implementation, originally with ASIC’s and more recently with a mixture
of ASIC’s and general-purpose CPU’s, all controlled by custom software. This
approach leads to long design cycles, difficulties in making changes at a later
point, high costs (with an accompanying strong incentive to keep all design
details proprietary due to the investment involved), and reliance on a single
vendor for the product. In contrast an open-source coprocessor by definition
doesn’t need to be proprietary, so it can use existing COTS hardware and
software as part of its design, which greatly reduces the cost (the coprocessor
described here is one to two orders of magnitude cheaper than proprietary
designs while offering generally equivalent performance and superior
functionality), and can be sourced from multiple vendors and easily migrated to
newer hardware as the current hardware base becomes obsolete.
The coprocessor requires three layers, the processor hardware, the firmware
which manages the hardware (for example initialisation, communications with the
host, persistent storage, and so on) and the software which handles the crypto
functionality. The following sections describe the coprocessor hardware and
resource management firmware on which the crypto control software runs.
2.1. Coprocessor Hardware
Embedded systems have traditionally been based on the VME bus, a 32-bit
data/32-bit address bus incorporated onto cards in the 3U (10 x 16cm) and 6U
(23 x 16cm) Eurocard form factor [38]. The VME bus is CPU-
independent and supports all popular microprocessors including Sparc, Alpha,
68K, and x86. An x86-specific bus called PC/104, based on the 104-pin ISA bus,
has become popular in recent years due to the ready availability of low-cost
components from the PC industry. PC/104 cards are much more compact at 9 x
9.5cm than VME cards, and unlike a VME passive backplane-based system can
provide a complete system on a single card [39]. PC/104-
Plus, an extension to PC/104, adds a 120- pin PCI connector alongside the
existing ISA one, but is otherwise mostly identical to PC/104 [40
]
In addition to PC/104 there are a number of functionally identical systems with
slightly different form factors, of which the most common is the biscuit PC, a
card the same size as a 3½" or occasionally 5¼" drive, with a
somewhat less common one being the credit card or SIMM PC roughly the size of a
credit card. A biscuit PC provides most of the functionality and I/O
connectors of a standard PC motherboard, as the form factor shrinks the I/O
connectors do as well so that a SIMM PC typically uses a single enormous edge
connector for all its I/O. In addition to these form factors there also exist
card PC’s (sometimes called slot PC’s), which are biscuit PC’s built as ISA or
(more rarely) PCI-like cards. A typical configuration for a low-end system is
a 5x86/133 CPU (roughly equivalent in performance to a 133 MHz Pentium), 8-16MB
of DRAM, 2-8MB of flash memory emulating a disk drive, and every imaginable
kind of I/O (serial ports, parallel ports, floppy disk, IDE hard drive, IR and
USB ports, keyboard and mouse, and others). High-end embedded systems built
from components designed for laptop use provide about the same level of
performance as a current laptop PC, although their price makes them rather
impractical for use as crypto hardware. To compare this with other well-known
types of crypto hardware, a typical smart card has a 5MHz 8-bit CPU, a few
hundred bytes of RAM, and a few kB of EEPROM, and a Fortezza card has a 10 or
20MHz ARM CPU, 64kB of RAM and 128kB of flash memory/EEPROM.
All of the embedded systems described above represent COTS components available
from a large range of vendors in many different countries, with a corresponding
range of performance and price figures. Alongside the x86-based systems there
also exist systems based on other CPU’s, typically ARM, Dragonball (embedded
Motorola 68K), and to a lesser extent PowerPC, however these are available from
a limited number of vendors and can be quite expensive. Besides the obvious
factor of system performance affecting the overall price, the smaller form
factors and use of exotic hardware such as non-generic-PC components can also
drive up the price. In general the best price/performance balance is obtained
with a very generic PC/104 or biscuit PC system.
2.2. Coprocessor Firmware
Once the hardware has been selected the next step is to determine what software
to run on it to control it. The coprocessor is in this case acting as a
special-purpose computer system running only the crypto control software, so
that what would normally be thought of as the operating system is acting as the
system firmware, and the real operating system for the device is the crypto
control software. The control software therefore represents an application-
specific operating system, with crypto objects such as encryption contexts,
certificates, and envelopes replacing the user applications which are managed
by conventional OS’s. The differences between a conventional system and the
crypto coprocessor running one typical type of firmware-equivalent OS are shown
in Figure 3.
Figure 3: Conventional system vs. coprocessor system layers
Since the hardware is in effect a general-purpose PC, there’s no need to use a
specialised, expensive embedded or real-time kernel or OS since a general-
purpose OS will function just as well. The OS choice is then something simple
like one of the free or nearly-free embeddable forms of MSDOS [41
][42][43] or an open source operating
system like one of the x86 BSD’s or Linux which can be adapted for use in
embedded hardware. Although embedded DOS is the simplest to get going and has
the smallest resource requirements, it’s really only a bootstrap loader for
real-mode applications and provides very little access to most of the resources
provided by the hardware. For this reason it’s not worth considering except on
extremely low-end, resource-starved hardware (it’s still possible to find
PC/104 cards with 386/40’s on them, although having to drive them with DOS is
probably its own punishment).
A better choice than DOS is a proper operating system which can fully utilise
the capabilities of the hardware. The only functionality which is absolutely
required of the OS is a memory manager and some form of communication with the
outside world. Also useful (although not absolutely essential) is the ability
to store data such as private keys in some form of persistent storage.
Finally, the ability to handle multiple threads may be useful where the device
is expected to perform multiple crypto tasks at once. Apart from the
multithreading, the OS is just acting as a basic resource manager, which is why
DOS could be pressed into use if necessary.
Both FreeBSD and Linux have been stripped down in various ways for use with
embedded hardware [44][45]. There’s not
really a lot to say about the two, both meet the requirements given above, both
are open source systems, and both can use a standard full-scale system as the
development environment — whichever one is the most convenient can be used. At
the moment Linux is a better choice because its popularity means there’s better
support for devices such as flash memory mass storage (relatively speaking, as
the Linux drivers for the most widely-used flash disk are for an old kernel
while the FreeBSD ones are mostly undocumented and rather minimal), so the
coprocessor described here uses Linux as its resource management firmware. A
convenient feature which gives the free Unixen an extra advantage over
alternatives like embedded DOS is that they’ll automatically switch to using
the serial port for their consoles if no video drivers and/or hardware are
present, which enables them to be used with cheaper embedded hardware which
doesn’t require additional video circuitry just for the one-off setup process.
A particular advantage of Linux is that it’ll halt the CPU when nothing is
going on (which is most of the time), greatly reducing coprocessor power
consumption and heat problems.
2.3. Firmware Setup
Setting up the coprocessor firmware involves creating a stripped-down Linux
setup capable of running on the coprocessor hardware. The services required of
the firmware are:
Memory management
Persistent storage services
Communication with the host
Process and thread management (optional)
All newer embedded systems support the M-Systems DiskOnChip (DOC) flash disk,
which emulates a standard IDE hard drive by identifying itself as a BIOS
extension during the system initialisation phase (allowing it to install a DOC
filesystem driver to provide BIOS support for the drive) and later switching to
a native driver for OS’s which don’t use the BIOS for hardware access [46]. The first step in installing the firmware involves formatting
the DOC as a standard hard drive and partitioning it prior to installing Linux.
The DOC is configured to contain two partitions, one mounted read-only which
contains the firmware and crypto control software, and one mounted read/write
with additional safety precautions like noexec
and nosuid, for storage of configuration
information and encrypted keys.
The firmware consists of a basic Linux kernel with every unnecessary service
and option stripped out. This means removing support for video devices, mass
storage (apart from the DOC and floppy drive), multimedia devices, and other
unnecessary bagatelles. Apart from the TCP/IP stack needed by the crypto
control software to communicate with the host, there are no networking
components running (or even present) on the system, and even the TCP/IP stack
may be absent if alternative means of communicating with the host (explained in
more detail further on) are employed. All configuration tasks are performed
through console access via the serial port, and software is installed by
connecting a floppy drive and copying across pre-built binaries. This both
minimises the size of the code base which needs to be installed on the
coprocessor, and eliminates any unnecessary processes and services which might
constitute a security risk. Although it would be easier if we provided a means
of FTP’ing binaries across, the fact that a user must explicitly connect a
floppy drive and mount it in order to change the firmware or control software
makes it much harder to accidentally (or maliciously) move problematic code
across to the coprocessor, provides a workaround for the fact that FTP over
alternative coprocessor communications channels such as a parallel port is
tricky without resorting to the use of even more potential problem software,
and makes it easier to comply with the FIPS 140 requirements that (where a non-
Orange Book OS is used) it not be possible for extraneous software to be loaded
and run on the system. Direct console access is also used for other operations
such as setting the onboard real-time clock, which is used to add timestamps to
signatures. Finally, all paging is disabled, both because it isn’t needed or
safe to perform with the limited-write-cycle flash disk, and because it avoids
any risk of sensitive data being written to backing store, eliminating a major
headache which occurs with all virtual-memory operating systems [
47].
At this point we have a basic system consisting of the underlying hardware and
enough firmware to control it and provide the services we require. Running on
top of this will be a daemon which implements the crypto control software which
does the actual work.
3. Crypto Functionality Implementation
Once the hardware and functionality level of the coprocessor have been
established, we need to design an appropriate programming interface for it. An
interface which employs complex data structures, pointers to memory locations,
callback functions, and other such elements won’t work with the coprocessor
unless a complex RPC mechanism is employed. Once we get to this level of
complexity we run into problems both with lowered performance due to data
marshalling and copying requirements and potential security problems arising
from inevitable implementation bugs.
Figure 4: cryptlib architecture
A better type of interface is the one used in the cryptlib security
architecture [48] which is depicted in Figure 4. cryptlib
implements an object- based design which assigns unique handles to crypto-
related objects but hides all further object details inside the architecture.
Objects are controlled through messages sent to them under the control of a
central security kernel, an interface which is ideally suited for use in a
coprocessor since only the object handle (a small integer value) and one or two
arguments (either an integer value or a byte string and string length) are
needed to perform most operations. This use of only basic parameter types
leads to a very simple and lightweight interface, with only the integer values
needing any canonicalisation (to network byte order) before being passed to the
coprocessor. A coprocessor call of this type, illustrated in Figure 5,
requires only a few lines of code more than what is required for a direct call
to the same code on the host system. In practice the interface is further
simplified by using a pre-encoded template containing all fixed parameters (for
example the type of function call being performed and a parameter count),
copying in any variable parameters (for example the object handle) with
appropriate canonicalistion, and dispatching the result to the coprocessor. The
coprocessor returns results in the same manner.
Figure 5: Communicating with the coprocessor
3.1. Communicating with the Coprocessor
The next step after designing the programming interface is to determine which
type of communications channel is best suited to controlling the coprocessor.
Since the embedded controller hardware is intended for interfacing to almost
anything, there are a wide range of I/O capabilities available for
communicating with the host. Many embedded controllers provide an ethernet
interface either standard or as an option, so the most universal interface uses
TCP/IP for communications. For card PC’s which plug into the hosts backplane
we should be able to use the system bus for communications, and if that isn’t
possible we can take advantage of the fact that the parallel ports on all
recent PC’s provide sophisticated (for what was intended as a printer port)
bidirectional I/O capabilities and run a link from the parallel port on the
host motherboard to the parallel port on the coprocessor. Finally, we can use
more exotic I/O capabilities such as USB to communicate with the coprocessor.
The most universal coprocessor consists of a biscuit PC which communicates with
the host over ethernet (or, less universally, a parallel port). One advantage
which an external, removable coprocessor of this type has over one which plugs
directly into the host PC is that it’s very easy to unplug the entire crypto
subsystem and store it separately from the host, moving it out of reach of any
covert access by outsiders while the owner of the system is away. In addition
to the card itself, this type of standalone setup requires a case and a power
supply, either internal to the case or an external wall-wart type (these are
available for about $10 with a universal input voltage range which allows them
to work in any country). The same arrangement is used in a number of
commercially-available products, and has the advantage that it interfaces to
virtually any type of system, with the commensurate disadvantage that it
requires a dedicated ethernet connection to the host (which typically means
adding an extra network card), as well as adding to the clutter surrounding the
machine.
The alternative option for an external coprocessor is to use the parallel port,
which doesn’t require a network card but does tie up a port which may be
required for one of a range of other devices such as external disk drives, CD
writers, and scanners which have been kludged onto this interface alongside the
more obvious printers. Apart from its more obvious use, the printer port can
be used either as an Enhanced Parallel Port (EPP) or as an Extended Capability
Port (ECP) [49]. Both modes provide about 1-2 MB/s data
throughput (depending on which vendors claims are to be believed) which
compares favourably with a parallel port’s standard software-intensive maximum
rate of around 150 kB/s and even with the throughput of a 10Mbps ethernet
interface. EPP was designed for general-purpose bidirectional communication
with peripherals and handles intermixed read and write operations and block
transfers without too much trouble, whereas ECP (which requires a DMA channel
which can complicate the host system’s configuration process) requires complex
data direction negotiation and handling of DMA transfers in progress, adding a
fair amount of overhead when used with peripherals which employ mixed reading
and writing of small data quantities. Another disadvantage of DMA is that its
use paralyses the CPU by seizing control of the bus, halting all threads which
may be executing while data is being transferred. Because of this the optimal
interface mechanism is EPP. From a programming point of view, this
communications mechanism looks like a permanent virtual circuit which is
functionally equivalent to the dumb wire which we’re using the ethernet link
as, so the two can be interchanged with a minimum of coding effort.
To the user, the most transparent coprocessor would consist of some form of
card PC which plugs directly into their system’s backplane. Currently
virtually all card PC’s have ISA bus interfaces (the few which support PCI use
a PCI/ISA hybrid which won’t fit a standard PCI slot [50])
which unfortunately doesn’t provide much flexibility in terms of communications
capabilities since the only viable means of moving data to and from the
coprocessor is via DMA, which requires a custom kernel-mode driver on both
sides. The alternative, using the parallel port, is much simpler since most
operating systems already support EPP and/or ECP data transfers, but comes at
the expense of a reduced data transfer rate and the loss of use of the parallel
port on the host. Currently the use of either of these options is rendered moot
since the ISA card PC’s assume they have full control over a passive-backplane-
bus system, which means they can’t be plugged into a standard PC which contains
its own CPU which is also assuming that it solely controls the bus. It’s
possible that in the future card PC’s which function as PCI bus devices will
appear, but until they do it’s not possible to implement the coprocessor as a
plug-in card without using a custom extender card containing an ISA or PCI
connector for the host side, a PC104 connector for a PC104-based CPU card, and
buffer circuitry in between to isolate the two buses. This destroys the COTS
nature of the hardware, limiting availability and raising costs.
The final communications option uses more exotic I/O capabilities such as USB
which are present on newer embedded systems, these are much like ethernet but
have the disadvantage that they are currently rather poorly supported by most
operating systems.
Since we’re using Linux as the resource manager for the coprocessor hardware,
we can use a multithreaded implementation of the coprocessor software to handle
multiple simultaneous requests from the host. After initialising the various
cryptlib subsystems, the control software creates a pool of threads which wait
on a mutex for commands from the host. When a command arrives, one of the
threads is woken up, processes the command, and returns the result to the host.
In this manner the coprocessor can have multiple requests outstanding at once,
and a process running on the host won’t block whenever another process has an
outstanding request present on the coprocessor.
3.2. Open vs Closed-source Coprocessors
There are a number of vendors who sell various forms of tier 2 coprocessor, all
of which run proprietary control software and generally go to some lengths to
ensure that no outsiders can ever examine it. The usual way in which vendors
of proprietary implementations try to build the same user confidence in their
product as would be provided by having the source code and design information
available for public scrutiny is to have it evaluated by independent labs and
testing facilities, typically to the FIPS 140 standard when the product
constitutes crypto hardware (the security implications of open source vs
proprietary implementations have been covered exhaustively in various fora and
won’t be repeated here). Unfortunately this process leads to prohibitively
expensive products (thousands to tens of thousands of dollars per unit) and
still requires users to trust the vendor not to insert a backdoor, or
accidentally void the security via a later code update or enhancement added
after the evaluation is complete (strictly speaking such post-evaluation
changes would void the evaluation, but vendors sometimes forget to mention this
in their marketing literature). There have been numerous allegations of the
former occurring [51][52][53
], and occasional reports of the latter.
In contrast, an open source implementation of the crypto control software can
be seen to be secure by the end user with no degree of blind trust required.
The user can (if they feel so inclined) obtain the raw coprocessor hardware
from the vendor of their choice in the country of their choice, compile the
firmware and control software from the openly-available source code, and
install it knowing that no supplemental functionality known only to a few
insiders exists. For this reason the entire suite of coprocessor control
software is available in source code form for anyone to examine, build, and
install as they see fit.
A second, far less theoretical advantage of an open-source coprocessor is that
until the crypto control code is loaded into it, it isn’t a controlled
cryptographic item as crypto source code and software aren’t controlled in most
of the world. This means that it’s possible to ship the hardware and software
separately to almost any destination (or source it locally) without any
restrictions and then combine the two to create a controlled item once they
arrive at their destination (like a two-component glue, things don’t get sticky
until you mix the parts).
4. Extended Security Functionality
The basic coprocessor design presented so far serves to move all security-
related processing and cryptovariables out of reach of hostile software, but by
taking advantage of the capabilities of the hardware and firmware used to
implement it, it’s possible to do much more. One of the features of the
cryptlib architecture is that all operations are controlled and monitored by a
central security kernel which enforces a single, consistent security policy
across the entire architecture. By tying the control of some of these
operations to features of the coprocessor, it’s possible to obtain an extended
level of control over its operation as well as avoiding some of the problems
which have traditionally plagued this type of security device.
4.1. Controlling Coprocessor Actions
The most important type of extra functionality which can be added to the
coprocessor is extended failsafe control over any actions it performs. This
means that instead of blindly performing any action requested by the host
(purportedly on behalf of the user), it first seeks confirmation from the user
that they have indeed requested that the action be taken. The most obvious
application of this mechanism is for signing documents where the owner has to
indicate their consent through a trusted I/O path rather than allowing a rogue
application to request arbitrary numbers of signatures on arbitrary documents.
This contrasts with other tier 1 and 2 processors which are typically enabled
through user entry of a PIN or password, after which they are at the mercy of
any commands coming from the host. Apart from the security concerns, the
ability to individually control signing actions and require conscious consent
from the user means that the coprocessor provides a mechanism required by a
number of new digital signature laws which recognise the dangers inherent in
systems which provide an automated (that is, with little control from the user)
signing capability.
Figure 6: Normal message processing
The means of providing this service is to hook into the cryptlib kernel’s sign
action and decrypt action processing mechanisms. In normal processing the
kernel receives the incoming message, applies various security-policy-related
checks to it (for example it checks to ensure that the object’s ACL allows this
type of access), and then forwards the message to the intended target, as shown
in Figure 6. In order to obtain additional confirmation that the action is to
be taken, the coprocessor can indicate the requested action to the user and
request additional confirmation before passing the message on. If the user
chooses to deny the request or doesn’t respond within a certain time, the
request is blocked by the kernel in the same manner as if the objects ACL
didn’t allow it, as shown in Figure 7. This mechanism is similar to the
command confirmation mechanism in the VAX A1 security kernel, which takes a
command from the untrusted VMS or Ultrix-32 OS’s running on top of it, requests
that the user press the (non-overridable) secure attention key to communicate
directly with the kernel and confirm the operation ("Something claiming to
be you has requested X. Is this OK?"), and then returns the user
back to the OS after performing the operation [54].
Figure 7: Processing with user confirmation
The simplest form of user interface involves two LED’s and two pushbutton
switches connected to a suitable port on the coprocessor (for example the
parallel port or serial port status lines). An LED is activated to indicate
that confirmation of a signing or decryption action is required by the
coprocessor. If the user pushes the confirmation button, the request is
allowed through, if they push the cancel button or don’t respond within a
certain time, the request is denied.
4.2. Trusted I/O Path
The basic user confirmation mechanism presented above can be generalised by
taking advantage of the potential for a trusted I/O path which is provided by
the coprocessor. The main use for a trusted I/O path is to allow for secure
entry of a password or PIN used to enable access to keys stored in the
coprocessor. Unlike typical tier 1 devices which assume the entire device is
secure and use a short PIN in combination with a retry counter to protect
cryptovariables, the coprocessor makes no assumptions about its security and
instead relies on a user-supplied password to encrypt all cryptovariables held
in persistent storage (the only time keys exist in plaintext form is when
they’re decrypted to volatile memory prior to use). Because of this, a simple
numeric keypad used to enter a PIN isn’t sufficient (unless the user enjoys
memorising long strings of digits for use as passwords). Instead, the
coprocessor can optionally make use of devices such as PalmPilots for password
entry, perhaps in combination with novel password entry techniques such as
graphical passwords [55]. Note though that, unlike a tier 0
crypto implementation, obtaining the user password via a keyboard sniffer on
the host doesn’t give access to private keys since they’re held on the
coprocessor and can never leave it, so that even if the password is compromised
by software on the host, it won’t provide access to the keys.
In a slightly more extreme form, the ability to access the coprocessor via
multiple I/O channels allows us to enforce strict red/black separation, with
plaintext being accessed through one I/O channel, ciphertext through another,
and keys through a third. Although cryptlib doesn’t normally load plaintext
keys (they’re generated and managed internally and can never pass outside the
security perimeter), when the ability to load external keys is required FIPS
140 mandates that they be loaded via a separate channel rather than over the
one used for general data, which can be provided for by loading them over a
separate channel such as a serial port (a number of commercial crypto
coprocessors come with a serial port for this reason).
4.3. Physically Isolated Crypto
It has been said that the only truly tamperproof computer hardware is Voyager
2, since it has a considerable air gap (strictly speaking a non-air gap) which
makes access to the hardware somewhat challenging (space aliens
notwithstanding). We can take advantage of air-gap security in combination
with cryptlib’s remote-execution capability by siting the hardware performing
the crypto in a safe location well away from any possible tampering. For
example by running the crypto on a server in a physically secure location and
tunneling data and control information to it via its built-in ssh or SSL
capabilities, we obtain the benefits of physical security for the crypto
without the awkwardness of having to use it from a secure location or the
expense of having to use a physically secure crypto module (the implications of
remote execution of crypto from a country like China with keys and crypto held
in Europe or the US are left as an exercise for the reader).
Physical isolation at the macroscopic level is also possible due to the fact
that cryptlib employs a separation kernel for its security [56
][57], which allows different object types (and, at the
most extreme level, individual objects) to be implemented in physically
separate hardware. For those requiring an extreme level of isolation and
security, it should be possible to implement the different object types in
their own hardware, for example keyset objects (which don’t require any real
security since certificates contain their own tamper protection) could be
implemented on the host PC, the kernel (which requires a minimum of resources)
could be implemented on a cheap ARM-based plug-in card, envelope objects (which
can require a fair bit of memory but very little processing power) could be
implemented on a 486 card with a good quantity of memory, and encryption
contexts (which can require a fair amount of CPU power but little else) could
be implemented using a faster Pentium-class CPU. In practice though it’s
unlikely that anyone would consider this level of isolation worth the expense
and effort.
5. Crypto Hardware Acceleration
So far the discussion of the coprocessor has focused on the security and
functionality enhancements it provides, avoiding any mention of performance
concerns. The reason for this is that for the majority of users the
performance is good enough, meaning that for typical applications such as email
encryption, web browsing with SSL, and remote access via ssh, the presence of
the coprocessor is barely noticeable since the limiting factors on performance
are set by network bandwidth, disk access times, modem speed, bloatware running
on the host system, and so on. Although never intended for use as a special-
purpose crypto accelerator of the type capable of performing hundreds of RSA
operations per second on behalf of a heavily-loaded web server, it is possible
to add extra functionality to the coprocessor through its built-in PC104 bus to
extend its performance. By adding a PC104 daughterboard to the device, it’s
possible to enhance its functionality or add new functionality in a variety of
ways, as explained below (although the prices quoted for devices will change
over time, the price ratios should remain relatively constant).
5.1. Conventional Encryption/Hashing
Implementing an algorithm like DES which was originally targeted at hardware
implementation, in a field-programmable gate array (FPGA) is relatively
straightforward, and hash algorithms like MD5 and SHA-1 can also be implemented
fairly easily in hardware by implementing a single round of the algorithm and
cycling the data through it the appropriate number of times. Using a low-cost
FPGA, it should be possible to build a daughterboard which performs DES and
MD5/SHA-1 acceleration for around $50. Unfortunately, a number of hardware and
software issues conspire to make this non-viable economically. The main
problem is that although DES is faster to implement in hardware than in
software, most newer algorithms are much more efficient in software (ones with
large, key-dependent S-boxes are particularly difficult to implement in FPGA’s
because they require huge numbers of logic cells, requiring very expensive
high-density FPGA’s). A related problem is the fact that in many cases the CPU
on the coprocessor is already capable of saturating the I/O channel
(ethernet/ECP/EPP/PC104) using a pure software implementation, so there’s
nothing to be gained by adding expensive external hardware (all of the
software-optimised algorithms run at several MB/s whereas the I/O channel is
only capable of handling around 1MB/s). The imbalance becomes even worse when
any CPU faster than the entry-level 5x86/133 configuration is used, since at
this point any common algorithm (even the rather slow triple DES) can be
executed more quickly in software than the I/O channel can handle. Because of
this it doesn’t seem profitable to try to augment software-based conventional
encryption or hashing capabilities with extra hardware.
5.2. Public-key Encryption
Public-key algorithms are less amenable to implementation in general-purpose
CPU’s than conventional encryption and hashing algorithms, so there’s more
scope for hardware acceleration in this area. We have two options for
accelerating public-key operations, either using an ASIC from a vendor or
implementing our own version with an FPGA. Bignum ASIC’s are somewhat thin on
the ground since the vendors who produce them usually use them in their own
crypto products and don’t make them available for sale to the public, however
there is one company who specialise in ASIC’s rather than crypto products who
can supply a bignum ASIC (it’s also possible to license bignum cores and
implement the device yourself, this option is covered peripherally in the next
section). Using this device, the PCC201 [58], it’s possible
to build a bignum acceleration daughterboard for around $100.
Unfortunately, the device has a number of limitations. Although impressive
when it was first introduced, the maximum key size of 1024 bits and maximum
throughput of 21 operations/s for 1024-bit keys and 74 operations/s for 512-bit
keys compares rather poorly with software implementations on newer Pentium-
class CPU’s, which can achieve the same performance with a CPU speed of around
200MHz. This means that although one of these devices would serve to
accelerate performance on a coprocessor based on the entry-level 5x86/133
hardware, a better way to utilise the extra expense of the daughterboard would
be to buy the next level up in coprocessor hardware, giving somewhat better
bignum performance and accelerating all other operations as well as a free
side-effect (the entry level for Pentium-class cards is one containing a 266MHz
Cyrix MediaGX, although it may be possible to put together an even cheaper one
using a bare card and populating it with an AMD K6/266, currently selling for
around $30). A second disadvantage of the PCC201 is that it’s made available
under peculiar export control terms which can make it cumbersome (or even
impossible) to obtain for anyone who isn’t a large company.
An alternative to using an ASIC is to implement our own bignum accelerator with
an FPGA, with the advantage that we can make it as fast as required (within the
limits of the available hardware). Again, there is the problem that much of
the published work in the area of bignum accelerator design is by crypto
hardware vendors who don’t make the details available, however there is one
reasonably fast implementation which achieves 83 operations/s for 1024-bit keys
and 340 operations/s for 512-bit keys using a total of 6,700 FPGA basic cells
(configurable logic blocks or CLB’s) [59]. The use of such a
large number of CLB’s requires the use of very high-density FPGA’s, of which
the most widely- used representative is the Xilinx XC4000 family
[60]. The cheapest available FPGA capable of implementing
this design, the XC40200, comes with a pre-printed mortgage application form
and a $2000-$2500 price tag (depending on speed grade and quantity), providing
a clue as to why the design has to date only been implemented on a simulator.
Again, it’s possible to buy an awful lot of CPU power for the same amount of
money (an equivalent level of performance to the FPGA design is obtainable
using about $200 worth of AMD Athlon CPU [61]).
This illustrates a problem faced by all hardware crypto accelerator vendors,
which may be stated as a derivation of Moore’s law: Intel can make it faster
cheaper than you can. In other words, putting a lot of effort into designing
an ASIC for a crypto accelerator is a risky investment because, aside from the
usual flexibility problems caused by the use of an ASIC, it’ll be rendered
obsolete by general-purpose CPU’s within a few years. This problem is
demonstrated by several products currently sold as crypto hardware accelerators
which in fact act as crypto handbrakes since, when plugged in or enabled,
performance slows down.
For pure acceleration purposes, the optimal price/performance tradeoff appears
to be to populate a daughterboard with a collection of cheap CPU’s attached to
a small amount of memory and just enough glue logic to support the CPU (this
approach is used by nCipher, who use a cluster of ARM CPU’s in their SSL
accelerators [62]). The mode of operation of this CPU farm
would be for the crypto coprocessor to halt the CPU’s, load the control
firmware (a basic protected-mode kernel and appropriate code to implement the
required bignum operation(s)) into the memory, and restart the CPU running as a
special-purpose bignum engine. For x86 CPU’s, there are a number of very
minimal open-source protected-mode kernels which were originally designed as
DOS extenders for games programming available, these ignore virtual memory,
page protection, and other issues and run the CPU as if it were very fast a 32-
bit real-mode 8086. By using a processor like a K6-2 3D/333 (currently selling
for around $35) which contains 32+32K of onboard cache, the control code can be
loaded initially from slow, cheap external memory but will execute from cache
at the full CPU speed from then on. Each of these dedicated bignum units
should be capable of ~200 512-bit RSA operations per second at a cost of around
$100 each.
Unfortunately the use of commodity x86 CPU’s of this kind has several
disadvantages. The first is that they are designed for use in systems with a
certain fixed configuration (for example SDRAM, PCI and AGP busses, a 64-bit
bus interface, and other high-performance options) which means that using them
with a single cheap 8-bit memory chip requires a fair amount of glue logic to
fake out the control signals from the external circuitry which is expected to
be present. The second problem is that these CPU’s consume significant amounts
of power and dissipate a large amount of heat, with current drains of 10-15A
and dissipations of 20-40W being common for the range of low-end processors
which might be used as cheap accelerator engines. Adding more CPU’s to improve
performance only serves to exacerbate this problem, since the power supplies
and enclosures designed for embedded controllers are completely overwhelmed by
the requirements of a cluster of these CPU’s. Although the low-cost processing
power offered by general-purpose CPU’s appears to make them ideal for this
situation, the practical problems they present rules them out as a solution.
A final alternative is offered by digital signal processors (DSP’s), which
require virtually no external circuitry since most newer ones contain enough
onboard memory to hold all data and control code, and don’t expect to find
sophisticated external control logic present. The fact that DSP’s are
optimised for embedded signal-processing tasks makes them ideal for use as
bignum accelerators, since a typical configuration contains two 32-bit single-
cycle multiply-accumulate (MAC) units which provide in one instruction the most
common basic operation used in bignum calculations. The best DSP choice
appears to be the ADSP-21160, which consumes only 2 watts and contains built-in
multiprocessor support allowing up to 6 DSP’s to be combined into one cluster
[63]. The aggregate 3,600 MFLOPS processing power provided
by one of these clusters should prove sufficient (in its integer equivalent) to
accelerate bignum calculations. The feasibility of using DSP’s as low-cost
accelerators is currently under consideration and may be the subject of a
future paper.
5.3. Other Functionality
In addition to pure acceleration purposes, it’s possible to use a PC104 add-on
card to handle a number of other functions. The most important of these is a
hardware random number generator (RNG), since the effectiveness of the standard
entropy-polling RNG using by cryptlib [64] is somewhat
impaired by its use in an embedded environment. A typical RNG would take
advantage of several physical randomness sources (typically thermal noise in
semiconductor junctions) fed into a Schmitt trigger with the output mixed into
the standard cryptlib RNG. The use of multiple independent sources ensures that
even if one fails the others will still provide entropy, and feeding the RNG
output into the cryptlib PRNG ensures that any possible bias is removed from
the RNG output bits.
A second function which can be performed by the add-on card is to act as a more
general I/O channel than the basic LED-and-pushbutton interface described
earlier, providing the user with more information (perhaps via an LCD display)
on what it is they’re authorising.
6. Conclusion
This paper has presented a design for an inexpensive, general-purpose crypto
coprocessor capable of keeping crypto keys and crypto processing operations
safe even in the presence of malicious software on the host which it is
controlled from. Extended security functionality is provided by taking
advantage of the presence of trusted I/O channels to the coprocessor. Although
sufficient for most purposes, the coprocessors processing power may be
augmented through the addition of additional modules based on DSP’s which
should bring the performance into line with considerably more expensive
commercial equivalents. Finally, the open-source nature of the design and use
of COTS components means that anyone can easily reassure themselves of the
security of the implementation and can obtain a coprocessor in any required
location by refraining from combining the hardware and software components
until they’re at their final destination.
Acknowledgements
The author would like to thank Paul Karger, Sean Smith, Brian Oblivion, Jon
Tidswell, Steve Weingart, Chris Zimman, and the referees for their feedback and
comments on this paper.
This paper was originally published in the
Proceedings of the 9th USENIX Security Symposium,
August 14-17, 2000, Denver, Colorado, USA
Last changed: 29 Jan. 2002 ml