Over the past few years, computer viruses and worms have evolved from nuisances to some of the most serious threats to Internet-connected computing assets. Global infections such as Code Red and Code Red II [21,40], Nimbda [30], Slammer [20], MBlaster [18], and MyDoom [17] are among an ever-growing number of self-replicating malicious code attacks plaguing the Internet with increasing frequency. These attacks have caused major disruptions, affecting hundreds of thousands of computers worldwide.
Recognition and diagnosis of these threats play an important role in defending computer assets. Until recently, however, network defense has been viewed as the responsibility of individual sites. Firewalls, intrusion detection, and antivirus tools, are, for the most part, deployed in the mode of independent site protection. Although these tools successfully defend against low or moderate levels of attack, no known technology can completely prevent large-scale concerted attacks.
There is an emerging interest in the development of Internet-scale threat analysis centers. Conceptually, these centers are data repositories to which pools of volunteer networks contribute security alerts, such as firewall logs, reports from antivirus software, and intrusion detection alerts (we will use the terms analysis center and alert repository interchangeably). Through collection of continually updated alerts across a wide and diverse contributor pool, one hopes to gain a perspective on Internet-wide trends, dominant intrusion patterns, and inflections in alert content that may be indicative of new wide-spreading threats. The sampling size and diversity of contributors are thus of great importance, as they impact the speed and fidelity with which threat diagnoses can be formulated.
We are interested in protecting sensitive data contained in security alerts against malicious users of alert repositories and corrupt repositories. The risk of leaking sensitive information may negatively impact the size and diversity of the contributor pool, add legal liabilities to center managers, and limit accessibility of raw alert content. We consider a three-way tradeoff between privacy, utility, and performance: privacy of alert contributors; utility of the analyses that can be performed on the sanitized data; and the performance cost that must be borne by alert contributors and analysts. Our objective is a solution that is reasonably efficient, privacy-preserving, and practically useful.
We investigate several types of attacks, including dictionary attacks which defeat simple-minded data protection schemes based on hashing IP addresses. In particular, we focus on attackers who may use the analysis center as a means to probe the security posture of a specific contributor and infer sensitive data such as internal network topology by analyzing (artificially stimulated) alerts. We present a set of techniques for sanitization of alert data. They ensure secrecy of sensitive information contained in the alerts, while enabling a large class of legitimate analyses to be performed on the sanitized alert pool. We then explain how trust requirements between the alert contributors and analysis centers can be further reduced by deploying an overlay protocol for randomized alert routing, and give a quantitative estimate of anonymity provided by this technique. We conclude by discussing performance issues.