OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
Review of Two Papers on Performance of Remote Procedure Calls
1. Review of Two Papers on Performance of Remote
Procedure Calls: `Performance of Firefly RPC' and
`Lightweight Procedure Call'
Quentin Fennessy
qfennessy@gmail.com
Unpublished, originally written June 26, 1996
This review covers two papers that discuss performance problems and resolutions in remote
procedure calls (RPC). The first paper [1] is a detailed report on RPC performance on the
Firefly multiprocessor system and includes precise measurements of the latency in the
system RPC. The paper goes to great lengths to account for time spent in RPCs, breaking
them down to packet creation, send and receive and packet reception. The authors also
estimate the improvement if certain proposed improvements were made. The second
paper [2] is a report on an highly optimized pseudo-RPC called LRPC (Lightweight RPC) as
implemented on Taos (an operating system also on the Firefly). Lightweight RPC is used to
optimize performance for RPC calls that do not cross machine boundaries and do not have
large or complicated data structures.
These two papers describe two approaches to the same problem. The problem is RPC
performance optimization. Firefly RPC (in [1] is traditional RPC with stub compilation in a
high level language. The RPCs in Firefly handle arbitrarily complex data structures and are
semantically consistent for both local and remote RPCs. Both Firefly RPC and LRPC are
optimized -- that is, the implementation is not straightforward but sacrifices benefits such as
security and portability in the quest for high performance. LRPC (in [2]) is a more exotic
implementation -- RPC so highly optimized for common cases that it barely deserves the
name. LRPC does not handle inter-machine communication and will only handle simple
data structures. The authors of [2] present a good case that most RPCs are actually local
calls with very simple data structure requirements.
The Firefly RPC paper [1] includes remarkably precise and detailed measurements of RPC
latency and throughput. It is very interesting to see both the fixed and variable delays
involved in process communication. The authors baselined their timing via null RPCs, and
compared those times to largest-packet-sized RPCs. Some timings were done by sending
10,000 packets and dividing elapsed time by 10,000, and some were done by counting
machine instructions involved and summing the times the instructions took (from a processor
reference manual). Not surprisingly a large part of the cost was variable and depended on
2. the size of the RPC packet (Ethernet latency, UDP checksum calculation and system bus
latency).
Some of the interesting aspects of the Firefly optimization were: the awakening of threads to
handle received packets from within interrupt routines and the sharing of address space
between all processes using RPC and the Ethernet driver. The authors admit that these
performance improvements `collapse layers of abstraction' and also admit to the security
implications of shared buffer space.
The Lightweight RPC paper [2] (also about the Firefly system running the TAOS operating
system) discusses a more radical RPC implementation. RPCs traditionally look like normal
function calls but are actually synchronous communication mechanisms with distinct
remote or local processes. The authors argue that simple and local RPCs deserve
optimization as they constitute the bulk of interprocess communication. Accordingly, they
have implemented LRPC (Lightweight RPC) with four new techniques. First the control
transfer between client and server is simplified. The client directly executes the requested
procedure in the server's space. Second, client and server share an argument stack. Third,
LRPC uses simple stubs that preclude sending complex data structures. Fourth and finally
LRPC avoids shared data structure bottlenecks and can take advantage of free processors
in the Firefly multiprocessor system.
These four techniques present interesting tradeoffs. Similar to [1] there are security
implications in the optimizations that involve shared data space between client and server.
These are handled in several ways. Client binding to servers is handled carefully. Clients
cannot communicate without objects that identify them to servers. Client calls are
rigorously checked before being mapped and executed in server space. RPC stubs are
generated in Firefly machine language. In the homogenous Firefly environment this is not
an issue. These stubs are up to four times faster than Modula2+ compiled stubs. LRPC stubs
are invoked directly by the Firefly kernel thus avoiding data copying or message checking
in user space. LRPC is optimized for multiprocessor use by avoiding shared data structures.
Shared argument stacks are locked individually and queuing on these stacks takes less than
2% of call time.
The first paper discusses optimization of traditional RPCs. These optimizations are easily
described but increase risk in execution and probably increase the difficulty of kernel code
maintenance in Firefly. Firefly RPC performance is compared with other distributed systems
such as Sprite, Amoeba, V, Cedar and UNIX. Although Firefly is the only VAX based system
absolute performance numbers are interesting to compare. Firefly RPC latency (at about
2.7ms/call) is within .2ms of the fastest RPC implementations (in V). Firefly RPC throughput
(at 4.6Mbit/sec) is above the median of the compared system but not quite as fast as that
3. in Sprite (at 5.6Mbit/sec).
The second paper optimizes Firefly RPCs for the simple cases -- local calls and simple data
structures. LRPC are demonstrably lightweight. Null LRPCs add only 48usec to minimum time
for each operation (for a total of 157usec). LRPC at 157usec compares very favorably with
the Firefly Null RPC at 464usec (3:1 difference!). Larger calls show almost the same ratio:
LRPC 200 byte calls at 227usec and Firefly RPC at 636usec. The multiprocessor optimizations
produce good linearity with respect to number of processors. Firefly RPC plateaus at two
processors while LRPC is linear at least to four processors.
These two papers agree in several areas on RPC optimization. Unfortunately, high-level
language implementations, nicely layered designs, clearly distinguished protection domains
and arbitrarily complex data structures are all sacrificed to the need for speed. RPC
optimization is critical in RPC acceptance as otherwise programmers will work around the
system. Fortunately, the dirty details of these optimizations can to a large degree be hidden
from programmers and users, thus allowing higher-level software engineering techniques in
user code.
Works Cited
[1] M. D. Schroeder and M. Burrows, "Performance of Firefly RPC," Transactions on
Computing Systems, vol. 8, no. 1, pp. 1-17, Feb 1990.
[2] B. N. Bershad, T. E. Anderson, E. D. Lazowska and H. M. Levy, "Lightweight Remote
Procedure Call," Transactions on Computing Systems, vol. 8, no. 1, pp. 37-55, Feb 1990.