Check out the new USENIX Web site.

USENIX Home . About USENIX . Events . membership . Publications . Students
First Workshop on Real, Large Distributed Systems — Abstract

Beyond Availability: Towards a Deeper Understanding of Machine Failure Characteristics in Large Distributed Systems

Praveen Yalagandula, The University of Texas at Austin; Suman Nath, Carnegie Mellon University; Haifeng Yu, Intel Research Pittsburgh and Carnegie Mellon University; Phillip B. Gibbons, Intel Research Pittsburgh; Srinivasan Sesha, Carnegie Mellon University

Abstract

Although many previous research efforts have investigated machine failure characteristics in distributed systems, availability research has reached a point where properties beyond these initial findings become important. In this paper, we analyze traces from three large distributed systems to answer several subtle questions regarding machine failure characteristics. Based on our findings, we derive a set of fundamental principles for designing highly available distributed systems. Using several case studies, we further show that our design principles can significantly influence the availability design choices in existing systems.
  • View the full text of this paper in HTML and PDF.
  • If you need the latest Adobe Acrobat Reader, you can download it from Adobe's site.
To become a USENIX Member, please see our Membership Information.

?Need help? Use our Contacts page.

Last changed: 1 Dec. 2004 ch
Technical Program
WORLDS '04 Home
USENIX home