NFS Tracing By Passive Network Monitoring (Extended Abstract)
Matt Blaze
Department of Computer Science
Princeton University
mab@cs.princeton.edu
Traces of filesystem activity have proven to be useful for a wide
variety of purposes, ranging from quantitative analysis of system
behavior to trace-driven simulation of filesystem algorithms. Such
traces can be difficult to obtain, however, usually entailing
modification of the filesystems to be monitored and runtime overhead
for the period of the trace. Largely because of these difficulties, a
surprisingly small number of filesystem traces have been conducted,
and few sample workloads are available to filesystem researchers.
This paper describes a portable toolkit for deriving approximate
traces of NFS [1] activity by non-intrusively monitoring the
Ethernet traffic to and from the file server. The toolkit uses a
promiscuous Ethernet listener interface (such as the Packetfilter[2])
to read and reconstruct NFS-related RPC packets intended
for the server. It produces traces of the NFS activity as well as a
plausible set of corresponding client system calls. The tool is
currently in use at Princeton and other sites, and is available via
anonymous ftp.
Motivation
Traces of real workloads form an important part of virtually all
analysis of computer system behavior, whether it is program hot spots,
memory access patterns, or filesystem activity that is being studied.
In the case of filesystem activity, obtaining useful traces is
particularly challenging. Filesystem behavior can span long time
periods, often making it necessary to collect huge traces over weeks
or even months. Modification of the filesystem to collect trace data
is often difficult, and may result in unacceptable runtime overhead.
Distributed filesystems exacerbate these difficulties, especially when
the network is composed of a large number of heterogeneous machines.
As a result of these difficulties, only a relatively small number of
traces of Unix filesystem workloads have been conducted, primarily in
computing research environments. [3], [4] and [5] are examples of
such traces.
Since distributed filesystems work by transmitting their activity over
a network, one can collect traces of such systems by placing a "tap"
on the network and observing the network traffic. Ethernet[6] based
networks lend themselves to this approach particularly well, since
traffic is broadcast to all machines connected to a given subnetwork.
General-purpose network monitoring tools (such as [7])
that "promiscuously" listen to the Ethernet are
useful for observing (and collecting statistics on) specific types of
packets, but the information they provide is at too low a level to be
useful for building filesystem traces. Filesystem operations may span
several packets, and may be meaningful only in the context of other,
previous operations.
Previous work on distributed filesystem network traffic has focused on
characterizing the network load itself (e.g., [8]). While useful for
understanding traffic patterns and developing a queueing model of NFS
loads, these previous studies do not use the network traffic to
analyze the file access
traffic patterns of the system, focusing instead on developing a
statistical model of the individual packet sources, destinations, and
types. Higher-level studies of file access patterns have traditionally
involved installing a trace package directly on the client and/or
server machines.
This paper describes a toolkit for collecting traces of NFS file
access activity by monitoring Ethernet traffic. A "spy" machine with
a promiscuous Ethernet interface is connected to the same network as
the file server. Each NFS packet is analyzed and a trace is produced
at an appropriate level of detail.
We partition the problem of deriving NFS activity from raw network
traffic into two fairly distinct subproblems: that of decoding the
low-level NFS operations from the packets on the network, and that of
translating these low-level commands back into user-level system
calls. Hence, the toolkit consists of two basic parts, an "RPC
decoder" ( rpcspy ) and the "NFS analyzer" ( nfstrace ).
rpcspy communicates with a low-level network monitoring
facility ([2], [9]) to read and reconstruct the RPC transactions (call
and reply) that make up each NFS command. nfstrace takes the
output of rpcspy and reconstructs an approximation of the
system calls that triggered the activity.
NFS Protocol Overview
It is well beyond the scope of this paper to describe the protocols
used by NFS; for a detailed description of how NFS works, the reader
is referred to [10], [11], and [12]. This section will give a very brief
overview of how NFS activity translates into Ethernet packets and the
problems a monitor program might have reconstructing the activity. In
particular, we discuss the stateless nature of NFS (no open or close
calls) and the way files are represented by handles.
The rpcspy Program
rpcspy is the interface to the system-dependent Ethernet monitoring
facility; it produces a trace of the RPC calls issued between a given
set of clients and servers. This section describes the overall
function of rpcspy in detail.
For each RPC transaction monitored, rpcspy produces an ASCII record
containing a timestamp, the name of the server, the client, the length
of time the command took to execute, the name of the RPC command
executed, and the command- specific arguments and return data.
Currently, rpcspy understands and can decode the 17 NFS RPC commands,
and there are hooks to allow other RPC services (for example, NIS) to
be added reasonably easily. The output may be read directly or piped
into another program (such as nfstrace ) for further analysis; the
format is designed to be reasonably friendly to both the human reader
and other programs (such as nfstrace or awk ).
Since each RPC transaction consists of two messages, a call and a
reply, rpcspy waits until it receives both these components and
emits a single record for the entire transaction. The basic output
format is 8 vertical-bar-separated fields:
timestamp | execution-time | server |
client | command-name | arguments | reply-data
where
timestamp
is the time the reply message was received,
execution-time
is the time (in microseconds) that elapsed between the call and reply,
server
is the name (or IP address) of the server,
client
is the name (or IP address) of the client followed by the userid that
issued the command,
command-name
is the name of the particular program invoked
(read , write , getattr ,
etc.), and
arguments
and
reply-data
are the command dependent arguments and return values passed to and
from the RPC program, respectively.
The exact format of the argument and reply data is dependent on the
specific command issued and the level of detail the user wants logged.
For example, a typical NFS command is recorded as follows:
690529992.167140 | 11717 | paramount | merckx.321 | read |
{"7b1f00000000083c", 0, 8192} | ok, 1871
In this example, uid 321 at client "merckx " issued an NFS
read
command to server "paramount " . The reply was issued at (Unix
time) 690529992.167140 seconds; the call command occurred 11717
microseconds earlier. Three arguments are logged for the read call:
the file handle from which to read (represented as a hexadecimal
string), the offset from the beginning of the file, and the number of
bytes to read. In this example, 8192 bytes are requested starting at
the beginning (byte 0) of the file whose handle is
"7b1f00000000083c " . The command completed successfully (status
"ok " ), and 1871 bytes were returned. Of course, the reply
message also included the 1871 bytes of data from the file, but that
field of the reply is not logged by rpcspy .
Implementation Issues
This section describes the actual implementation of rpcspy , and some
of the less obvious problems in actually getting it to work the right
way. In particular, we discuss the representation of file handles
across various NFS implementations, caching of IP address/name
translations, memory management, and packet fragmentation.
nfstrace : The Filesystem Tracing Package
nfstrace is a filter for rpcspy that produces a log of a
plausible set of user level filesystem commands that could have
triggered the monitored activity. A record is produced each time a
file is opened, giving a summary of what occurred. This summary is
detailed enough for analysis or for use as input to a filesystem
simulator.
The output format of nfstrace consists of 7 fields:
timestamp | command-time | direction |
file-id | client | transferred | size
where
timestamp
is the time the open occurred,
command-time
is the length of time between open and close,
direction
is either read or write
file-id
identifies the server and the file handle,
client
is the client and user that performed the open,
transferred
is the number of bytes of the file actually read or written
size
is the size of the file (in bytes).
An example record might be as follows:
690691919.593442 | 17734 | read | basso:7b1f00000000400f |
frejus.321 | 0 | 24576
Here, userid 321 at client frejus read file
7b1f00000000400f on server basso . The file is 24576 bytes
long and was able to be read from the client cache. The command
started at Unix time 690691919.593442 and took 17734 microseconds
at the server to execute.
Nfstrace produces an approximation of the underlying user activity.
This section will describe the heuristics used by nfstrace to approximate
the original system calls. We discuss the discovery of cache hits vs.
cache misses, file name translation and other such issues.
Using rpcspy and nfstrace for Filesystem Tracing
This section describes the applications of rpcspy and nfstrace .
Clearly, nfstrace is not suitable for producing highly accurate
traces; cache hits are only estimated, the timing information is
imprecise, and data from lost (and duplicated) network packets are not
accounted for. When such a highly accurate trace is required we must
resort to more traditional tracing approaches.
We compare nfstrace with other approaches to file system tracing, and
describe the test suite that was used to validate the accuracy of the
trace results. We also will briefly discuss some of the social and
ethical issues arising out of research based on trace data collected
from real users.
A Trace of Filesystem Activity in the Princeton C.S. Department
In a previous paper[14] presented at USENIX, we analyzed a five-day
long trace of filesystem activity conducted on 112 research
workstations at DEC-SRC. The paper identified a number of file access
properties that affect filesystem caching performance; it is
difficult, however, to know whether these properties were unique
artifacts of that particular environment or are more generally
applicable. This section describes how we used rpcspy and
nfstrace to conduct a week long trace of filesystem activity in
the Princeton University Computer Science Department. Approximately
500,000 file opens were recorded.
We will compare the results of the Princeton nfstrace trace
with the DEC-SRC trace of the previous paper. We describe the
environment in which the traces were collected. Measurements of the
Princeton data were remarkably similar to those taken on the SRC data
in the previous paper.
In particular, we will examine observed hit rate, file write sharing and
file "entropy". Data will be described graphically and analytically.
Conclusions
Although not as accurate as direct, kernel-based tracing, a passive
network monitor such as the one described in this paper can permit
tracing of distributed systems relatively easily. The ability to
limit the data collected to a high-level log of only the data required
can make it practical to conduct traces over several months. Such a
long-term trace is presently being conducted at Princeton as part of
the author's research on filesystem caching. The non-intrusive nature
of the data collection makes traces possible at facilities where
kernel modification is impractical or unacceptable.
Availability
The toolkit is available for anonymous ftp over the Internet from
samadams.princeton.edu , in the compressed tar file
nfstrace/nfstrace.tar.Z .
References
Sandberg, R., Goldberg, D., Kleiman, S., Walsh, D., & Lyon, B. "Design
and Implementation of the Sun Network File System."
Proc. USENIX,
Summer, 1985.
Mogul, J., Rashid, R., & Accetta, M. "The Packet Filter: An Efficient
Mechanism for User-Level Network Code."
Proc. 11th ACM Symp. on Operating Systems Principles,
1987.
Ousterhout J., et al. "A Trace-Driven Analysis of the Unix 4.2 BSD
File System."
Proc. 10th ACM Symp. on Operating Systems Principles,
1985.
Floyd, R. "Short-Term File Reference Patterns in a UNIX Environment,"
TR-177
Dept. Comp. Sci, U. of Rochester, 1986.
Baker, M. et al. "Measurements of a Distributed File System,"
Proc. 13th ACM Symp. on Operating Systems Principles,
1991.
Metcalfe, R. & Boggs, D. "Ethernet: Distributed Packet Switching for
Local Computer Networks,"
CACM
July, 1976.
"Etherfind(8) Manual Page,"
SunOS Reference Manual,
Sun Microsystems, 1988.
Gusella, R. "Analysis of Diskless Workstation Traffic on an Ethernet,"
TR-UCB/CSD-87/379,
University Of California, Berkeley, 1987.
"NIT(4) Manual Page,"
SunOS Reference Manual,
Sun Microsystems, 1988.
"XDR Protocol Specification,"
Networking on the Sun Workstation,
Sun Microsystems, 1986.
"RPC Protocol Specification,"
Networking on the Sun Workstation,
Sun Microsystems, 1986.
"NFS Protocol Specification,"
Networking on the Sun Workstation,
Sun Microsystems, 1986.
Postel, J. "User Datagram Protocol,"
RFC 768,
Network Information Center, 1980.
Blaze, M., and Alonso, R.,
"Long-Term Caching Strategies for Very Large Distributed File Systems,"
Proc. Summer 1991 USENIX,
1991.
|