USENIX ;login: - reliability

on reliability

Security and Reliability

John Sellens is part of the GNAC Canada team in Toronto, and proud to be husband to one and father to two.

I will attempt to provide an overview of how "security" contributes to the reliability of your systems (and, I hope, show how a lack of security decreases reliability). Some of the topics covered relate to discussions in the previous articles in this series; this is an attempt to gather everything related to security and reliability in one place, which should (in theory) make it easier to see the big picture.

Security is a wide-ranging and sometimes poorly defined topic. "Computer people" often (incorrectly) think of security as being related only to things that you can do with a keyboard, a computer, random bits and bytes, and someone else's password. Accordingly, I will attempt to summarize what security is, or at least what it is in the context of this article. The relevant security-related elements I will cover are:

Access control — passwords and other mechanisms that attempt to require some level of authentication and authorization for access to your networks and systems (i.e., how to know when to open the barn door).

Physical security — protection against physical attacks and "acts of God."

Intrusion detection — how to detect when someone unexpected has entered through an open or insufficiently closed barn door.

Correction — fixing things when they break, which includes "remotivating" individuals when they act inconsistently with what is expected.

Change management — to ensure that changes are appropriate and have been subjected to the appropriate review and approval.

Security is not just "prevention" — it's prevention, detection, control, and correction. And when you have all those, you have (of course) a more reliable system. Let's review each of the five elements in turn.

Access Control — Authentication and Authorization

Please allow me to be ridiculous for a moment: If you have no access control, anyone can do anything to your systems, and so they are almost by definition unreliable. And you're similarly exposed even if you have good access control but have no authorization mechanism that limits what different users can do.

I'm going to discuss access control in two parts, authentication and authorization. I'll further subdivide the discussion of authorization into logical access restrictions, physical access restrictions, and activity restrictions.

Authentication

Authentication is what we are all (I hope) familiar with — some form of userid and password pair that "proves" that the user is who he or she claims to be. In theory, both the userid and password could change with time, but the most common implementations involve a publicly known userid and a static password. (I'll define a static password as one that stays the same until it is explicitly changed, typically by the user.)

Static passwords are most commonly stored on the destination machine (or network of machines), typically in an encrypted or obscured form to prevent the casual browsing of passwords. The most commonly used "more secure" static-password mechanism is kerberos, by which the passwords are stored on a "secure" server, and the protocol protects the passwords as they are passed back and forth between various clients and servers to authenticate users. One problem with this is the weakest-link problem — you need to have kerberos locally available on every device, or you risk sending your password over some connection in cleartext, which limits the effectiveness of kerberos in those environments. (You can sometimes secure otherwise cleartext links with ssh encryption, but then you need to have ssh on the local box, which is the same problem but with a different piece of software.)

More advanced (i.e., obfuscated or annoying) systems use some form of one-time-
password (OTP) system to guard against password eavesdropping (over the shoulder or over the network) and password sharing. Some OTP systems are software only (such as S/KEY[1]) but the more common approach requires the use of some form of token, which computes or reports the next password in the series; the most commonly used token is probably SecurID from Security Dynamics. These systems have some mechanism to guard against password reuse and typically also require some sort of secret PIN in addition to the token in order to authenticate.

The reason for an authentication mechanism is to identify the user, and the more reliable the authentication mechanism, the more reliable your overall system is going to be, because you have a better "front-door" defense to protect you from the unreliable among us. I'm a big fan of one-time passwords — only the most rudimentary of systems would not benefit from the use of OTPs, even if used only for authenticating the more privileged users or for granting root (or equivalent) access.

Authorization

The next element in an access-control system is what I will refer to as "logical access restrictions." Logical access restrictions are those that are based on such things as the originating network address of a connection, time of day, or current usage rules. The most common way to implement originating network restrictions (in the UNIX world at least) is through the use of the "TCP wrapper"[2] package, which makes it easy to "wrap" certain services (such as telnet) with an access-control program that can restrict on the basis of network address, etc. The other logical restrictions are more commonly implemented with certain operating-system configurations, custom shells, or commercial software. Restrictions that you might want or need to implement include:

No multiple logins — You may wish to limit concurrent access to particular applications or systems for reasons of system load, security, or (business) process control.
No logins from multiple (apparent) locations — You may wish to prevent users from being in two places at once, primarily for security reasons. If a user is in your office, working away, it might be safe to conclude that a login attempt from halfway around the world was not actually the same person who is authorized to use your system. Of course, a connection from two different locations doesn't automatically mean that it's two different people, since the person could have connected to the remote machine over the network before connecting back, but it would mean your traffic might be taking a long, possibly exposed route, which probably isn't a good thing either.
No off-hours connections — It's probably not reasonable (or expected) for an ordinary accounts-payable clerk to be doing a check-printing run at 2:00 am Sunday morning; you might want to use "off-hours" restrictions to prevent that before it happens.
No connections during maintenance periods — Sometimes you need the machine to be up and running but don't want to allow (ordinary) users to sign on. Allowing users on during system maintenance can sometimes just make things more complicated. One classic example is doing full backups to a network backup server in preparation for an OS upgrade. On UNIX, this can sometimes be accomplished with the /etc/nologin file (but often not).
Peak-load restrictions — You might wish to refuse additional logins if the system load (however that is calculated) is above some threshold. This can help avoid making an already bad situation even worse.

As a more concrete example, I'll mention that I once implemented an access-control system that made use of almost all those restrictions. The system was a job database for university students, serving several thousand students each term, and the usage was very "peaky" — there were a few periods of very high demand and long periods of limited use. We wanted to make sure that each student was using only one session at a time and that we didn't let on too many simultaneous users (because otherwise the system would crawl to a halt); and we wanted to be able to prevent user signons during the daily update window and during periods of "emergency" maintenance. We did all that with, a simple shell script, which we used as each student's login shell, proving that (once again) complicated solutions are not always required.

It's probably worth mentioning that mechanisms and policies like this have a long history in the mainframe world.

Complementary to logical access restrictions are (of course) physical access restrictions. Sometimes you wish to allow access only to a particular system, application, or function if the user is (thought to be) in a "secure" location. For UNIX systems the most common example of this kind of restriction is to allow direct root logins only from the system console. Other examples include allowing connections only from within your building, enforced either through the use of hardwired connections (almost unheard of in these days of networked workstations), subnets and firewalls, or simply not allowing any connections (network or dialup) to and from the outside.

The final component of access control that I am going to cover are what I'll call "activity restrictions" — restrictions or limits on the commands and functions that a user can invoke. These come into effect once your authentication system has identified a particular user, and the user (or the user's connection) has passed any logical or physical access restrictions that have been implemented.

One of the most common (UNIX) examples of activity restrictions is the common requirement that a user be a member of a certain group (often "wheel" or group 0) in order to "su" to root. Lots of other examples of group- or ACL-based restrictions exist. Other restrictions can be implemented by applications, using compiled in information (bad), or files or database entries with restriction or permission information.

I suggest dividing activity restrictions into three types:

Static — yes/no restrictions, independent of other considerations, such as date or time, other users, etc. Some examples are: group membership requirements, as mentioned above; the prevention of access to a general-purpose environment by such mechanisms as menuing systems; restricted shells.
Variable — restrictions that are based on straightforward but varying information, such as date or time, connection origin, etc. Some examples are: no recreational Web sites during business hours; not allowing check printing outside regular hours; no root access if you're not on the local network.
Complex — restrictions that are controlled by other events, situations, status, etc. Some examples are: permission granted or denied on the basis of file or database contents; task allowed only at certain steps in a business process; there must be an operator on duty before the operation is allowed.

You will have noticed that the line between activity restrictions and logical and physical access restrictions gets a little blurry sometimes.

I don't claim to have covered all situations here. One obvious situation that's not covered is multiple authentication, where two or more people must agree and authenticate before a task is executed. (Recall those action movies featuring nuclear missile silos where two people have to turn two different keys on opposite sides of the room at the same time, and they're both carrying guns.) And I haven't mentioned the use of biometrics for authentication.

Some of these access-control mechanisms can be quite inconvenient and/or obtrusive. As in most other discussions of reliability, there's a tradeoff between reliability and control on the one hand and cost and inconvenience on the other, and each organization must strike the most appropriate balance for its needs. And I haven't mentioned the need for proper logging, which is a necessity for tracking, troubleshooting, and change control.

And to tie this discussion back to reliability, good access control means that you limit, control, or track who did (or could do) what, when, and under what circumstances. This means that when you determine that certain controls or limits are required to help your systems, networks, and business processes function reliably, you've got (at least part of) the mechanism to help you implement them.

Physical Security

The preceding discussion has focused primarily on electronic access to systems and networks, which is the traditional area of concern for computer-oriented people. But it's just as important to consider the physical security aspects, and again balance the costs (monetary and otherwise) against the expected risks and/or advantages. Note that I'm not talking here about disaster-recovery planning, or high-availability hardware — I'm talking about preventing people (or things) from getting physical access to your premises or equipment.

Why is physical security important? In most cases, physical access to a machine is tantamount to administrator access. In the most extreme cases, a machine (or parts of it) is stolen and attacked at the thief's leisure, whim, or screwdriver. Physical security can also help to guard against so-called acts of God — a more secure building is likely to be stronger and more appropriately located.

What kinds of things should physical security guard against, and how do they contribute to reliability?

Equipment theft — If your computers, disks, or network hubs are missing, it's hard to offer a reliable service. It's also getting fairly common for people to steal internal components, such as processors, memory, or disks, as they're easier to carry, and often not immediately obviously missing.
Media theft — Removable media (disks, tapes, cartridges, etc.) can contain very useful information (consider the backups of your customer database) and are often easy to carry and not likely to be noticed missing very quickly. The business risks here are obvious, but the impact on reliability is less so. Reliability can be reduced by the possibility of the use of confidential information to attack the organization at a later date and by the lack of backups, which could make recovery a very painful process.
Console or network access — Unauthorized access to console ports or network connections or devices can open you up to all sorts of attacks (such as password sniffing and bug exploitation) that can cause your systems to start behaving unreliably, unintentionally or otherwise.
Physical destruction — A fire ax can render even the best systems and equipment somewhat unreliable.

What sort of controls should be considered?

Locks — of varying and appropriate levels of sophistication. It's not unusual to start making the locks more difficult and complicated the closer one gets to the important stuff. Consider the use of ordinary keys that can be copied at the local all-night convenience store, high-security keys that require special blanks and machines to duplicate them, magnetic or other types of access-control cards, combination locks, biometric scans, etc. And remember the hardware that the lock cylinders are attached to — a high-security key cylinder on a $2 latch might not be the most prudent way of securing your machine room.
Access controls — Electronic logs of who went where and when are handy after the fact and can act as a convenient deterrent; time-of-day restrictions can be used to prevent (or limit) physical access in the middle of the night when no one else is around; and requiring the cooperation of two people to open a particular door all have their place in an access control plan.
Structural — Check your walls, ceilings, and floors for ease of access and/or destruction. There's no point in putting a high-security lock on a low-security door and frame.
Monitoring — Consider the use of fire and burglar alarms, video monitoring, etc., but make sure that it's hard for an intruder to get at the video tapes and destroy them and the evidence that they contain.
More extreme — Some organizations will find it worthwhile to go to greater lengths to secure their premises. Armed guards, "man traps" (small rooms with two independently locking doors that you must pass through when entering or leaving), guard dogs, etc.

Intrusion Detection

The best security system in the world is reduced in its effectiveness if it's not properly monitored. You must have some mechanisms and processes that are designed to detect any intrusions that do take place and, optimally, any attempted intrusions that were blocked by the systems. Proper intrusion-detection systems will alert you when you're under attack and will give you time to increase your awareness or monitoring to fend off any further attacks. For example, if you can detect when a copy of your encrypted passwords have been stolen, you have a better chance of changing all the passwords before they get cracked and exploited and of blocking the access used to intrude. Quite simply, if you can't detect when something has gone awry, you've got much less chance of protecting yourself and your systems. And if you can't protect the systems, it's going to be harder to keep them working reliably.

Techniques and mechanisms for intrusion detection include:

Log review — Review your log files, either manually or using some (well-protected) automatic tools, so that you'll have a better chance of noticing the unexpected when it happens.
Realtime alarm monitoring — Page yourself or the security company when alarms get triggered, or have someone on duty onsite. A quick response to an alarm or alert can limit the amount of damage that happens and can also serve as a deterrent if the attacker is simply looking for a fertile playground, not targeting you specifically.
Periodic analysis — Tools such as tripwire[3] can notice when files change
unexpectedly.

Correction

Once you've detected an intrusion or attack (or attempted attack), you need a mechanism and process by which you can put things right again, and, optimally, a way to prevent it from happening again. Keep good backups, know where your distribution media is, have documented procedures and mechanisms to get in touch with the necessary people. Keep up to date on vendor updates, notices, and security alerts in the community at large. Be ready to disconnect machines or networks that are under attack or need repair while you investigate and undertake repairs.

The impact on reliability should be clear — a modified machine or system is at risk, and the sooner you can get things back together, the sooner normal, reliable operation can resume.

The other side of correction is "remotivating" individuals who are acting contrary to policy and reasonable standards of behavior. A user or system administrator who behaves incorrectly (let's say by choosing trivial passwords and writing them on notepaper stuck to their monitor) can be putting other users, systems, and information at risk. If you're expecting people to act appropriately, you had better define and publish what the standards of behavior are and be prepared to enforce or explain them.

Change Management

The best-laid plans can be all for naught if there are no controls around them, and one of the most important controls is change management. The primary components of a proper change-management system are:

Review — Proper review and testing of any proposed changes will greatly increase the likelihood of a successful, nondisruptive change, and will help prevent intentionally or unintentionally malicious changes from being undertaken. Note that a proper review also ensures that there is proper documentation.
Authorization — Ensures that changes (to, say, the payroll system) have been properly authorized according to the policies of the organization. Some systems might require only a very low level of authorization to change, but you might want to make sure that any changes to the CEO's laptop are made by a limited set of people.
Proper implementation — A standard and documented implementation process will help avoid mistakes, will keep downtime to a minimum, and will make your maintenance windows much more bearable.
Ability to back out — This is often overlooked, but a proper change-management system is prepared to deal with changes that fail (in whatever way) once they are put into production, and ensures that there is a way to get back to the pre-change, working, reliable system.

In Summary

Security is a wide-ranging topic and has an impact on many areas of an organization's activities. Proper security systems, mechanisms, policies, and practices sometimes augment reliability, but in many ways their primary reliability benefit is in preventing the intentional or unintentional reduction of current reliability levels.

This is the ninth article in the On Reliability series published in ;login: over the past two years and concludes the list of topics that I had planned to cover. Thanks very much for reading, and I hope you've found this series useful.

References

[1] The S/KEY One-Time Password System from Bellcore. <ftp://ftp.bellcore.com/pub/nmh/>

[2] TCP Wrapper by Wietse Venema (<wietse@porcupine.org>). <ftp://ftp.porcupine.org/pub/security/index.html>

[3] Tripwire, originally by Gene Kim and Gene Spafford, is available from <ftp://info.cert.org/pub/tools/tripwire/> and provides tools to track when system files change unexpectedly.