Next: Flash Performance Breakdown
Up: Performance Evaluation
Previous: Synthetic Workload
Trace-based experiments
While the single-file test can indicate a server's maximum performance
on a cached workload, it gives little indication of its performance on
real workloads. In the next experiment, the servers are subjected to
a more realistic load. We generate a client request stream by
replaying access logs from existing Web servers.
Figure 8:
Performance on Rice Server Traces/Solaris
|
Figure 8 shows the throughput in Mb/sec
achieved with various Web servers on two different workloads. The
``CS trace'' was obtained from the logs of Rice University's Computer
Science departmental Web server. The ``Owlnet trace'' reflects traces
obtained from a Rice Web server that provides personal Web pages for
approximately 4500 students and staff members. The results were
obtained with the Web servers running on Solaris.
The results show that Flash with its AMPED architecture achieves the
highest throughput on both workloads. Apache achieves the lowest
performance. The comparison with Flash-MP shows that this is only
in part the result of its MP architecture, and mostly due to its lack
of aggressive optimizations like those used in Flash.
The Owlnet trace has a smaller dataset size than the CS trace, and it
therefore achieves better cache locality in the server. As a result,
Flash-SPED's relative performance is much better on this trace, while
MP performs well on the more disk-intensive CS trace. Even though the
Owlnet trace has high locality, its average transfer size is smaller
than the CS trace, resulting in roughly comparable bandwidth numbers.
A second experiment evaluates server performance under realistic
workloads with a range of dataset sizes (and therefore working set
sizes). To generate an input stream with a given dataset size, we
use the access logs from Rice's ECE departmental Web server and
truncate them as appropriate to achieve a given dataset size. The
clients then replay this truncated log as a loop to generate requests.
In both experiments, two client machines with 32 clients each are
used to generate the workload.
Figure 9:
FreeBSD Real Workload - The SPED architecture is ideally
suited for cached workloads, and when the working set fits in cache,
Flash mimics Flash-SPED. However, Flash-SPED's performance drops
drastically when operating on disk-bound workloads.
|
Figure 10:
Solaris Real Workload - The Flash-MT server has comparable
performance to Flash for both in-core and disk-bound workloads. This
result was achieved by carefully minimizing lock contention, adding
complexity to the code. Without this effort, the disk-bound results
otherwise resembled Flash-SPED.
|
Figures 9 (BSD) and 10
(Solaris) shows the performance, measured as the total output
bandwidth, of the various servers under real workload and various
dataset sizes. We report output bandwidth instead of request/sec in
this experiment, because truncating the logs at different points to
vary the dataset size also changes the size distribution of requested
content. This causes fluctuations in the throughput in requests/sec,
but the output bandwidth is less sensitive to this effect.
The performance of all the servers declines as the dataset size
increases, and there is a significant drop at the point when the
working set size (which is related to the dataset size) exceeds the
server's effective main memory cache size. Beyond this point, the
servers are essentially disk bound. Several observation can be made
based on these results:
- Flash is very competitive with Flash-SPED on cached workloads,
and at the same time exceeds or meets the performance of the MP
servers on disk-bound workloads. This confirms that Flash with its
AMPED architecture is able to combine the best of other architectures
across a wide range of workloads. This goal was central to the
design of the AMPED architecture.
- The slight performance difference between Flash and Flash-SPED
on the cached workloads reflects the overhead of checking for cache
residency of requested content in Flash.
Since the data is already in memory, this test causes unnecessary
overhead on cached workloads.
- The SPED architecture performs well for cached workloads but its
performance deteriorates quickly as disk activity increases. This
confirms our earlier reasoning about the performance tradeoffs
associated with this architecture. The same behavior can be seen in
the SPED-based Zeus' performance, although its absolute performance
falls short of the various Flash-derived servers.
- The performance of Flash MP server falls significantly short of
that achieved with the other architectures on cached workloads. This
is likely the result of the smaller user-level caches used in Flash-MP
as compared to the other Flash versions.
- The choice of an operating system has a significant impact on
Web server performance. Performance results obtained on Solaris are up
to 50% lower than those obtained on FreeBSD. The operating system
also has some impact on the relative performance of the various Web
servers and architectures, but the trends are less clear.
- Flash achieves higher throughput on disk-bound workloads because
it can be more memory-efficient and causes less context switching than
MP servers. Flash only needs enough helper processes to keep the disk
busy, rather than needing a process per connection. Additionally, the
helper processes require little application-level memory. The
combination of fewer total processes and small helper processes
reduces memory consumption, leaving extra memory for the filesystem
cache.
- The performance of Zeus on FreeBSD appears to drop only after
the data set exceeds 100 MB, while the other servers drop earlier. We
believe this phenomenon is related to Zeus's request-handling, which
appears to give priority to requests for small documents. Under full
load, this tends to starve requests for large documents and thus
causes the server to process a somewhat smaller effective working
set. The overall lower performance under Solaris appears to mask this
effect on that OS.
- As explained above, Zeus uses a two-process configuration in
this experiment, as advised by the vendor. It should be noted that
this gives Zeus a slight advantage over the single-process Flash-SPED,
since one process can continue to serve requests while the other is
blocked on disk I/O.
Results for the Flash-MT servers could not be provided for FreeBSD
2.2.6, because that system lacks support for kernel threads.
Next: Flash Performance Breakdown
Up: Performance Evaluation
Previous: Synthetic Workload
Peter Druschel
1999-04-27