NSDI '08 – Abstract
Pp. 423–437 of the Proceedings
D3S: Debugging Deployed Distributed Systems
Xuezheng Liu and Zhenyu Guo, Microsoft Research Asia; Xi Wang, Tsinghua University; Feibo Chen, Fudan University; Xiaochen Lian, Shanghai Jiaotong University; Jian Tang and Ming Wu, Microsoft Research Asia; M. Frans Kaashoek, MIT CSAIL; Zheng Zhang, Microsoft Research Asia
Abstract
Testing large-scale distributed systems is a challenge, because some
errors manifest themselves only after a distributed sequence of
events that involves machine and network failures. D3S is a
checker that allows developers to specify predicates on distributed
properties of a deployed system, and that checks these predicates
while the system is running. When D3S finds a problem it produces
the sequence of state changes that led to the problem,
allowing developers to quickly find the root cause.
Developers write predicates in a simple and sequential programming
style, while D3S checks these predicates in a distributed and
parallel manner to allow checking to be scalable to
large systems and fault tolerant. By using binary instrumentation, D3S works
transparently with legacy systems and can change predicates to be
checked at runtime. An evaluation with 5 deployed systems shows that
D3S can detect non-trivial correctness and performance bugs at
runtime and with low performance overhead (less than 8%).
- View the full text of this paper in HTML and PDF. Listen to the presentation in
MP3 format.
The Proceedings are published as a collective work, © 2008 by the USENIX Association. All Rights Reserved. Rights to individual papers remain with the author or the author's employer. Permission is granted for the noncommercial reproduction of the complete work for educational or research purposes. USENIX acknowledges all trademarks within this paper.
|