06threadsimp

Threads Implementations

CS 167 VI–1 Copyright © 2006 Thomas W. Doeppner. All rights reserved.

VI–1

Outline

• Threads implementations
– the one-level model
– the variable-weight processes model
– the two-level model (one kernel thread)
– the two-level model (multiple kernel threads)
– the scheduler-activations model
– performance


VI–2

Implementing Threads

• Components
– chores: the work that is to be done
– processors: the active agents
– threads: the execution context


Before we discuss how threads are implemented, let’s introduce some terms that will be
helpful in discussing implementations. It’s often convenient to separate the notion of the
work that is being done (e.g., the computation that is to be performed) from the notion of
some active agent who is doing the work. So we call the former a chore, and the latter a
processor. Examples of the former include computing the value of π, looking up an entry in a
database, and redrawing a window.
To model systems in sufficient detail to discuss performance, we use a further
abstraction—contexts. When a processor executes code, its registers contain certain values,
some of which define such things as the stack. These values must be loaded into a processor
so that the processor can execute the code associated with handling a chore; conversely, this
information must be saved if the processor is to be switched to executing the code associated
with some other chore. We define a thread (or thread of control) to be this information; in
other words, it is the execution context.

VI–3

Scheduling

• Chores on threads
– event loops
• Threads on processors
– time-division multiplexing
- explicit thread switching
- time slicing


An important aspect of multithreading is scheduling threads on processors. One might
think that the most important aspect is the schedule: when is a particular thread chosen for
execution by a processor? This is certainly not unimportant, but what is perhaps more
crucial is where the scheduling takes place and what sort of context is being scheduled.
The simplest system would be to handle a single chore in the context of a single thread
being executed by a single processor. This would be a rather limited system: it would
perform that chore and nothing else. A simple multiprocessor system would consist of
multiple chores each handled in the context of a separate thread, each being executed by a
separate processor.
More realistic systems handle many chores, both sequentially and concurrently. This
requires some sort of multiplexing mechanism. One might handle multiple chores with a
single thread: this is the approach used in event-handling systems, in which a thread is
used to support an event loop. In response to an event, the thread is assigned an associated
chore (and the processor is directed to handle that chore).
Time-division multiplexing allows chores to be handled concurrently and can be done by
dividing up a processor’s time among a number of threads. The mechanism for doing this
might be time-slicing: assigning the processor to a thread for a certain amount of time before
assigning it to another thread, or explicit: code in the chore releases the processor so that it
may switch to another thread. In either case, a scheduler is employed to determine which
threads should be assigned processors.

VI–4

Multiplexing Processors

Blocked
Runnable
Keyboard

Running
Runnable
Blocked
Running Disk
Runnable


To be a bit more precise about scheduling, let’s define some more (standard) terms.
Threads are in either a blocked state or a runnable state: in the former they cannot be
assigned a processor, in the latter they can. A scheduler determines which runnable threads
should be assigned processors. Runnable threads that have been assigned activities are
called running threads.

VI–5

One-Level Model

User

Kernel

Processors


In most systems there are actually two components of the execution context: the user
context and the kernel context. The former is for use when an activity is executing user code;
the latter is for use when the activity is executing kernel code (on behalf of the chore). How
these contexts are manipulated is one of the more crucial aspects of a threads
implementation.
The conceptually simplest approach is what is known as the one-level model: each thread
consists of both contexts. Thus a thread is scheduled to an activity and the activity can
switch back and forth between the two types of contexts. A single scheduler in the kernel can
handle all the multiplexing duties. The threading implementation in Windows is (mostly)
done this way.

VI–6

Variable-Weight Processes

• Variant of one-level model
• Portions of parent process selectively copied
into or shared with child process
• Children created using clone system call


Unlike most other Unix systems, which make a distinction between processes and threads,
allowing multithreaded processes, Linux maintains the one-thread-per-process approach.
However, so that we can have multiple threads sharing an address space, Linux supports the
clone system call, a variant of fork, via which a new process can be created that shares
resources (in particular, its address space) with the parent. The result is a variant of the one-
level model.
This approach is not unique to Linux. It’s used in SGI’s IRIX and was first discussed in
early ’89, when it was known as variable-weight processes. (See “Variable-Weight Processes
with Flexible Shared Resources,” by Z. Aral, J. Bloom, T. Doeppner, I. Gertner, A.
Langerman, G. Schaffer, Proceedings of Winter 1989 USENIX Association Meeting.)

VI–7

Cloning
Signal
Info
Parent Child
Files:
file-descriptor
table

FS:
root, cwd,
umask

Virtual Memory


As implemented in Linux, a process may be created with the clone system call (in addition
to using the fork system call). One can specify, for each of the resources shown in the slide,
whether a copy is made for the child or the child shares the resource with the parent. Only
two cases are generally used: everything is copied (equivalent to fork) or everything is shared
(creating what we ordinarily call a thread, though the “thread” has a separate process ID).

VI–8

Linux Threads
(pre 2.6)

Initial
Thread

Manager
Thread

Other Pipe
Thread
Other
Thread
Other
Thread


Building a POSIX-threads implementation on top of Linux’s variable-weight processes
requires some work. What’s discussed here is the approach used prior to Linux 2.6. Some
information about the threads implementation of 2.6 can be found at
http://people.redhat.com/drepper/nptl-design.pdf.
Each thread is, of course, a process; all threads of the same computation share the same
address space, open files, and signal handlers. One might expect that the implementation of
pthread_create would be a simple call to clone. This, unfortunately, wouldn’t allow an easy
implementation of operations such as pthread_join: a Unix process may wait only for its
children to terminate; a POSIX thread can join with any other joinable thread. Furthermore,
if a Unix process terminates, its children are inherited by the init process (process number
1). So that pthread_join can be implemented without undue complexity, a special manager
thread (actually a process) is the parent/creator of all threads other than the initial thread.
This manager thread handles thread (process) termination via the wait4 system call and thus
provides a means for implementing pthread_join. So, when any thread invokes
pthread_create or pthread_join, it sends a request to the manager via a pipe and waits for a
response. The manager handles the request and wakes up the caller when appropriate.
The state of a mutex is represented by a bit. If there are no competitors for locking a
mutex, a thread simply sets the bit with a compare-and-swap instruction (allowing atomic
testing and setting of the mutex’s state bit). If a thread must wait for a mutex to be unlocked,
it blocks using a sigsuspend system call, after queuing itself to a queue headed by the
mutex. A thread unlocking a mutex wakes up the first waiting thread by sending it a Unix
signal (via the kill system call). The wait queue for condition variables is implemented in a
similar fashion.
On multiprocessors, for mutexes that are neither recursive nor error-checking, waiting is
implemented with an adaptive strategy: under the assumption that mutexes are typically not
held for a long period of time, a thread attempting to lock a locked mutex “spins” on it for up
to a short period of time, i.e., it repeatedly tests the state of the mutex in hopes that it will be
unlocked. If the mutex does not become available after the maximum number of tests, then
the thread finally blocks by queuing itself and calling sigsuspend.

VI–9

Two-Level Model
One Kernel Thread

User

Kernel

Processors


Another approach, the two-level model, is to represent the two contexts as separate types
of threads: user threads and kernel threads. Kernel threads become “virtual activities” upon
which user threads are scheduled. Thus two schedulers are used: kernel threads are
multiplexed on activities by a kernel scheduler; user threads are multiplexed on kernel
threads by a user-level scheduler. An extreme case of this model is to use only a single
kernel thread per process (perhaps because this is all the operating system supports). The
Unix implementation of the Netscape web browser was based on this model (recent Solaris
versions use the native Solaris implementation of threads), as were early Unix threads
implementations. There are two obvious disadvantages of this approach, both resulting from
the restriction of a single kernel thread per process: only one activity can be used at a time
(thus a single process cannot take advantage of a multiprocessor) and if the kernel thread is
blocked (e.g., as part of an I/O operation), no user thread can run.

VI–10

Two-Level Model:
Multiple Kernel Threads

User

Kernel

Processors


A more elaborate use of the two-level model is to allow multiple kernel threads per process.
This deals with both the disadvantages described above and is the basis of the Solaris
implementation of threading. It has some performance issues; in addition the notion of
multiplexing user threads onto kernel threads is very different from the notion of
multiplexing threads onto activities—there is no direct control over when a chore is actually
run by an activity. From an application’s perspective, it is sometimes desired to have direct
control over which chores are currently being run.

VI–11

Scheduler Activations

User
Kernel


A third approach, known historically as the scheduler activations model, is that threads
represent user contexts, with kernel contexts supplied when needed (i.e., not as a kernel
thread, as in the two-level model). User threads are multiplexed on activities by a user-level
scheduler, which communicates to the kernel the number of activities needed (i.e., the
number of ready user threads). The kernel multiplexes entire processes on activities—it
determines how many activities to give each process. This model, which is the basis for the
Digital-Unix (now True64 Unix) threading package, certainly gives direct control to the user
application over which chores are being run.
To make some sense of this, let’s work through an example. A process starts up,
containing a single user execution context (and user thread) and a kernel execution context
(and kernel thread). Following the dictates of its scheduling policy, the kernel scheduler
assigns a processor to the process. If the kernel thread blocks, the process implicitly
relinquishes the processor to the kernel scheduler, and gets it back once it unblocks.
Suppose that the user program creates a new thread (and its associated user execution
context). If actual parallelism is desired, code in the user-level library notifies the kernel that
two processors are desired. When a processor becomes available, the kernel creates a new
kernel execution context; using the newly available processor running in the new kernel
execution context, it places an upcall (going from system code to user code, unlike a system
call, which goes from user code to system code) to the user-level library, effectively giving it
the processor. The library code then assigns this processor to the new thread and user
execution context.

VI–12

(continued)

User
Kernel


The user application might then create another thread. It might also ask for another
processor, but this machine only has two. However, let’s say that one of its other threads
(thread 1) blocks on a page fault. The kernel, getting the processor back, creates a new
kernel execution context and places another upcall to our process, telling it two things:
• The thread using kernel execution context 1 has blocked, and thus it has lost its
processor (processor 1).
• Here is processor 1, can you use it?
In our case the process will assign the processor to thread 3. But soon the page being
waited for by thread 1 becomes available. The kernel should notify the process of this event,
but, of course, it requires a processor to do this. So it uses one of the processors already
assigned to the process, the one the process has assigned to thread 2. The process is now
notified of the following two events:
• The thread using kernel execution context 1 has unblocked (i.e., it would be
running, if only it had a processor).
• I’m telling you this using processor 2, which I’ve taken from the thread that was
using kernel execution context 2.
The library now must decide what to do with the processor that has been handed to it. It
could give it back to thread 2, leaving thread 1 unblocked, but not running, in the kernel, it
could continue the suspension of thread 2 and give the processor to thread 1, or it could
decide that both threads 1 and 2 should be running now and thus suspend thread 3, give its
processor to thread 1, and give thread 2 its processor back.

VI–13

(still continued)

User
Kernel


At some point the kernel is going to decide that the process has had one or both processors
long enough (e.g., a time slice has expired). So it yanks one of the processors away and,
using the other processor, makes an upcall conveying the following news:
• I’ve taken processor 1.
• I’m telling you this using processor 2.
The library learns that it now has only one processor, but with this knowledge it can assign
the processor to the most deserving thread.

VI–14

Performance

• One-level model
– operations on threads are expensive (require
system calls)
– example: mutual exclusion in Windows
- critical section implemented partly in user
mode
• success case in user code
- mutex implemented completely in kernel
- success case is 20 times faster for critical
section than for mutex


The one-level model is the most straightforward—unlike the others, there is but a single
scheduler. This scheduler resides in the kernel; hence the bulk of the data structures and
code required to represent and manipulate threads is in the kernel (though not all, as we
discuss below). Thus many thread operations, such as synchronization, thread creation and
destruction, involve calls to kernel code—system calls. Since in most architectures such calls
(from user code) are significantly more expensive than calls to user code, threading
implementations based on the one-level model are prone to high operation costs.
To illustrate the performance penalty incurred when performing a system call, we
measured the costs of performing an operation both in user space and in the kernel in
Windows NT 4.0. The operation, waiting for an object to become unlocked and then locking
it, is performed frequently by numerous applications and has a highly optimized
implementation, especially for the case we exercised, in which the object is not previously
locked. NT provides two constructs for doing this—the critical section and the mutex. The
former is implemented partly in user code and partly in kernel code. If the object in question
(represented by the critical section) is not locked, the critical section operates strictly in user
mode. The mutex is implemented entirely in kernel code: regardless of its state, operations
on it involve system calls. Our measurements show that requests to lock, then unlock a
mutex take twenty times longer than to lock and unlock a critical section when the mutex
and critical section are not already locked. (Operations on the two take the same amount of
time when the mutex and critical section are locked by another thread.)

VI–15

Performance
(continued)

• Two-level model (good news)
– many operations on threads are done strictly
in user code: no system calls


The two-level model makes it possible to eliminate some of the overhead of the one-level
model. Since user threads are multiplexed on kernel threads, the thread that is directly
manipulated by application code is the user thread, implemented entirely in user-level code.
As long as operations on user threads do not involve operations on kernel threads, all
execution takes place at user level and thus the cost of calling kernel code is avoided. The
trick, of course, is to avoid operations on kernel threads.
The user-level library maintains a ready list of runnable user threads. When a running
user thread must block for synchronization, it is put on a wait queue and its kernel thread
switches to run the user thread at the head of the ready list. If the list is empty, the thread
executes a system call to cause it to block. When one user thread unblocks another, the
latter is moved to the end of the ready list. If kernel threads are available (and thus the ready
list was empty), a system call is required to wake up the kernel thread so that it can run the
unblocked user thread. Thus operations on user threads induce (expensive) operations on
kernel threads if there is a surplus of kernel threads.

VI–16

Performance
(still continued)

• Two-level model (not-so-good news)
– if not enough kernel threads, deadlock is
possible
- Solaris automatically creates a new kernel
thread if all are blocked


Runtime decisions must be made about kernel threads. How many should there be? When
should they be created? With the single-kernel-thread version of the model, these questions
are answered trivially; coping with these questions for the general model is a very important
aspect.
One concern is deadlock. For example, suppose a process has two chores being handled by
two user threads but just one kernel thread. The kernel thread has been assigned to one of
the user threads and is blocked. There is code that could be executed in the other user
thread that would unblock the first, but since no kernel thread is available, this code will
never be executed—both user threads (and the kernel thread) are blocked forever. (This
scenario could happen on a Unix system, for example, if the two user threads are
communicating via a pipe and the first is blocked on a read, waiting for the other to do a
write.)
In Solaris, this problem is prevented with the aid of the operating system. If the OS detects
that all of a process’s kernel threads are blocked, it notifies the user-level threads code,
which creates a new kernel thread if there are user threads that are runnable.

VI–17

Performance
(yet still continued)

• Two-level model (more bad news)
– loss of parallelism if not enough kernel threads
- use pthread_setconcurrency in Solaris
– excessive overhead if too many kernel threads


What if there are not enough kernel threads? As discussed in the previous page, the
Solaris kernel insures that there are enough kernel threads to prevent deadlock. However,
we might have a situation in which there are two processors and two kernel threads. One
kernel thread is blocked (its user thread is waiting on I/O); the other kernel thread is
running (its user thread is in a compute loop). If there is another runnable user thread, it
won’t be able to run until a kernel thread becomes available, even though there is an
available processor. (A new one won’t automatically be created, since this is done only when
all of a process’s kernel threads are blocked.) One overcomes this problem in Solaris by
using the pthread_setconcurrency routine to set a lower bound on the number of kernel
threads used by a process.
What if there are too many kernel threads? One result would be that at times there might
be more ready kernel threads than activities and thus the threads’ execution would be time-
sliced. Unless there are vastly too many threads, this would cause no noticeable problems.
However, there are more subtle issues. For example, suppose two user threads are each
handling a separate chore, with synchronization constructs being used to alternate the
threads’ executions to ensure that they are not executed simultaneously. If we use just a
single kernel thread, the synchronization is handled entirely in user space: first one user
thread runs, then joins a wait queue; the kernel thread runs the other user thread, which
soon releases the first user thread and joins a wait queue itself, and so forth. The kernel
thread alternates running the two user threads and execution never enters the kernel.

VI–18

Performance
(notes continued)

• Two-level model (more bad news)
– loss of parallelism if not enough kernel threads
- use pthread_setconcurrency in Solaris
– excessive overhead if too many kernel threads


Now suppose we add another kernel thread. When one user thread is released from
waiting, since a kernel thread is available, that thread is woken up (via a system call) to run
the waking user thread. When the first user thread subsequently blocks, its kernel thread
has nothing to do and must perform a system call to block in the kernel. Thus each user
thread runs on a separate kernel thread and system calls are required to repeatedly block
and release the kernel threads.
We performed exactly this experiment on Solaris 2.6. Two user threads alternated their
execution one million times, using semaphores for synchronization. The total running time
was 24.6 seconds when one kernel thread was used, but was 68.5 seconds when two kernel
threads were used—a slowdown of almost a factor of three.

VI–19

Performance
(final word)

• Scheduler activations model
– no problems with too few or too many kernel
threads
- (it doesn’t have any)


The scheduler-activations model, since it has no kernel threads at all, clearly has no
problems resulting from having either too many or too few of them. Since threads are
represented entirely in user space, operations on them are relatively cheap. The kernel,
knowing exactly how many ready threads each process has, ensures that no activity is
needlessly idle.

VI–20

06threadsimp

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a 06threadsimp

Semelhante a 06threadsimp (20)

Último

Último (20)

06threadsimp