Synthesis: An Efficient Implementation of Fundamental Operating System Services - Abstract

6. Fine-Grain Scheduling

The most exciting phrase to hear in science, the
one that heralds new discoveries, is not "Eureka!"
(I found it!) but "That's funny ..."
-- Isaac Asimov

6.1 Scheduling Policies and Mechanisms

There are two parts to scheduling: the policy and the mechanism. The policy determines when a job should run and for how long. The mechanism implements the policy.

Traditional scheduling mechanisms have high overhead that discourages frequent scheduler decision making. Consequently, most scheduling policies try to minimize their actions. We observe that high scheduling and dispatching overhead is a result of implementation, not an inherent property of all scheduling mechanisms. We call scheduling mechanisms fine-grain if their scheduling/dispatching costs are much lower than a typical CPU quantum, for example, context switch overhead of tens of microseconds compared to CPU quanta of milliseconds.

Traditional timesharing scheduling policies use some global property, such as job priority, to reorder the jobs in the ready queue. A scheduling policy is adaptive if the global property is a function of the system state, such as the total amount of CPU consumed by the job. A typical assumption in global scheduling is that all jobs are independent of each other. But in a pipeline of processes, where successive stages are coupled through their input and output, this assumption does not hold. In fact, a global adaptive scheduling algorithm may lower the priority of a CPU-intensive stage, making it the bottleneck and slowing down the whole pipeline.

To make better scheduling decisions for I/O-bound processes, we take into account local information and coupling between jobs in addition to the global properties. We call such scheduling policies fine-grain because they use local information. An example of interesting local information is the amount of data in the job's input queue: if it is empty, dispatching the job will merely block for lack of input. This chapter focuses on the coupling between jobs in a pipeline using as the local information the amount of data in the queues linking the jobs.

Fine-grain scheduling is implemented in the Synthesis operating system. The approach is similar to feedback mechanisms in control systems. We measure the progress of each job and make scheduling decisions based on the measurements. For example, if the job is "too slow," say because its input queue is getting full, we schedule it more often and let it run longer. The measurements and adjustments occur frequently, accurately tracking each job's needs.

The key idea in fine-grain scheduling policy is modeled after the hardware phase locked loop (PLL). A PLL outputs a frequency synchronized with a reference input frequency. Our software analogs of the PLL track a reference stream of interrupts to generate a new stable source of interrupts locked in step. The reference stream can come from a variety of sources, for example an I/O device, such as disk index interrupts that occur once every disk revolution, or the interval timer, such as the interrupt at the end of a CPU quantum. For readers unfamiliar with control systems, the PLL is summarized in Section 6.2.

Fine-grain scheduling would be impractical without fast interrupt processing, fast context switching, and low dispatching overhead. Interrupt handling should be fast, since it is necessary for dispatching another process. Context switch should be cheap, since it occurs often. The scheduling algorithm should be simple, since we want to avoid a lengthy search or calculations for each decision. Chapter 3 already addressed the first two requirements. Section 6.2.3 shows that the scheduling algorithms are simple.

Figure 6.1: PLL Picture 6.2 Principles of Feedback

6.2.1 Hardware Phase Locked Loop

Figure 6.1 shows the block diagram of a PLL. The output of the PLL is an internallygenerated frequency synchronized to a multiple of the external input frequency. The phase comparator compares the current PLL output frequency, divided by N , to the input frequency. Its output is proportional to the difference in phase (frequency) between its two inputs, and represents an error signal that indicates how to adjust the output to better match the input. The filter receives the signal from the phase comparator and tailors the time-domain response of the loop. It ensures that the output does not respond too quickly to transient changes in the input. The voltage-controlled oscillator (VCO) receives the filtered signal and generates an output frequency proportional to it. The overall loop operates to compensate the variations on input, so that if the output rate is lower than the input rate, the phase comparator, filter, and oscillator work together to increase the output rate until it matches the input. When the two rates match, the output rate tracks the input rate and the loop is said to be locked to the input rate.

6.2.2 Software Feedback

The Synthesis fine-grain scheduling policies have the same three elements as the hardware PLL. They track the difference between the running rate of a job and the reference frame in a way analogous to the phase comparator. They use a filter to dampen the oscillations in the difference, like the PLL filter. And they re-schedule the running job to minimize its error compared to the reference, in the same way the VCO adjusts the output frequency.

Figure 6.2: Relationship between ILL and FLL

Let us consider a practical example from a disk driver: we would like to know which sector is under the disk head to perform rotational optimization in addition to the usual seek optimizations. This information is not normally available from the disk controller. But by using feedback, we can derive it from the index-interrupt that occurs once per disk revolution, supplied by some ESDI disk controllers. The index-interrupt supplies the input reference. The rate divider, N , is set to the number of sectors per track. An interval timer functions as the VCO and generates periodic interrupts corresponding to the passage of new sectors under the drive head. The phase comparator and filter are algorithms described in Section 6.2.3.

When we use software to implement the PLL idea, we find more flexibility in measurement and control. Unlike hardware PLLs, which always measure phase differences, software can measure either the frequency of the input (events per second), or the time interval between inputs (seconds per event). Analogously, we can adjust either the frequency of generated interrupts or the intervals between them. Combining the two kinds of measurements with the two kinds of adjustments, we get four kinds of software locked loops. This dissertation looks only at software locked loops that measure and adjust the same variable. We call a software locked loop that measures and adjusts frequency an FLL (frequency locked loop) and a software locked loop that measures and adjusts time intervals an ILL (interval locked loop).

In general, all stable locked loops minimize the error (feedback signal). Concretely, an FLL measures frequency by counting events, so its natural behavior is to maintain the number of events (and thus the frequency) equal to the input. An ILL measures intervals, so its natural behavior is to maintain the interval between consecutive output interrupts equal to the interval between inputs. At first, this seems to be two ways of looking at the same thing. And if the error were always zero, it would be. But when a change in the input happens, there is a period of time when the loop oscillates before it converges to the new output value. During this time, the differences between ILL and FLL show up. An FLL tends to maintain the correct number of events, although the interval between them may vary from the ideal. An ILL tends to maintain the correct interval, even though it might mean losing some events to do so.

This natural behavior can be modified with filters. The overall response of a software locked loop is determined by the kind of filter it uses to transform measurements into adjustments. A low-pass filter makes the FLL output frequency or the ILL output intervals more uniform, less sensitive to transient changes in the input. But it also delays the response to important changes in the input. An integrator filter allows the loop to track linearly changing input without error. Without an integrator, only constant input can be tracked error-free. Two integrators allows the loop to track quadratically changing input without error. But too many integrators tend to make the loop less stable and lengthens the time it takes to converge. A derivative filter improves response to sudden changes in the input, but also makes the loop more prone to noise. Like their hardware analogs, these filters can be combined to improve both the response time and stability of the SLL.

6.2.3 FLL Example

Figure 6.3 shows the general algorithm for an FLL that generates a stream of interrupts at four times the rate of a reference stream. The procedure i1 services the reference stream of interrupts, while the procedure i2 services the generated stream. The variable freq holds the frequency of i2 interrupts and is updated whenever i1 or i2 runs. The variable residue keeps track of differences between i1 and i2, serving the role of the phase comparator in a hardware PLL. Each time i1 executes, it adds 4 to residue. Each time i2 executes, it subtracts 1 from residue. The Filter function determines how the residue affects the frequency adjustments.

int residue=0, freq=0;
/* Master (reference frame) */	/* Slave (derived interrupt) */
i1()				i2()
{				{
    residue += 4;		    residue--;
    freq += Filter(residue);	    freq += Filter(residue);
	:				:
	:			     <do work>
     <do work>			        :
	:			    next_time = NOW + 1/freq;
	:			    schedintr(i2, next_time);
    return;			    return;
}				}

Figure 6.3: General FLL

    static int lopass;
    lopass = (7*lopass + x) / 8;
    return lopass;

Figure 6.4: Low-pass Filter

If i2 and i1 were running at the perfect relative rate of 4 to 1, residue would tend to zero and freq would not be changed. But if i2 is slower than 4 times i1, residue becomes positive, increasing the frequency of i2 interrupts. Similarly, if i2 is faster than 4 times i1, i2 will be slowed down. As the difference in relative speeds increases, the correction becomes correspondingly larger. As i1 and i2 approach the exact ratio of 1:4, the difference decreases and we reach the minimum correction with residue being decremented by one and incremented by four, cycling between -2 and +2. Since residue can never converge to zero - only hover around it - the i2 execution frequency will always jitter slightly. In practice, residue would be scaled down by an appropriate factor so that the jitter is negligible.

    static int accum;
    accum = accum + x;
    return accum;

Figure 6.5: Integrator Filter

    static int old_x;
    int dx;
    dx = x - old_x;
    old_x = x;
    return dx;

Figure 6.6: Derivative Filter

Figures 6.4, 6.5, and 6.6 show some simple filters that can be used alone or in combination to improve the responsiveness and stability of the FLL. In particular, the lowpass filter shown in Figure 6.4 helps eliminate the jitter mentioned earlier at the expense of a longer settling time. The variable lopass keeps a "history" of what the most recent residue were. Each update adds 1=8 of the new residue to 7=8 of the old lopass. This has the effect of taking a weighted average of recent residues. When residue is positive for many iterations, as is the case when i2 is too slow, lopass will eventually be equal to residue. But if residue oscillates rapidly, as in the situation described in the previous paragraph, lopass will go to zero. The derivative is never used alone, but can be used in combination with other filters to improve response to rapidly-changing inputs.

6.2.4 Application Domains

We choose between measuring and adjusting frequency and intervals depending on the desired accuracy and application. Accuracy is an important consideration because we can measure only integer quantities: either the number of events (frequency), or the clock ticks between events (interval). We would like to measure the larger quantity of the two since it carries higher accuracy.

Let us consider a scenario that favors ILL. Suppose you have a microsecond-resolution interval timer and the input event occurs about once per second. To make the output interval match the input interval, the ILL measures second-long intervals with a microsecond resolution timer, achieving high accuracy with few events. Consequently, ILL stabilizes very quickly. In contrast, by measuring frequency (counting events), an FLL needs more events to detect and adjust the error signal. Empirically, it takes about 50 input events (in about 50 seconds) for the output to stabilize to within 10% of the desired value.

A second scenario favors FLL. Suppose you have an interval timer with the resolution of one-sixtieth of a second. The input event occurs 30 times a second. Since the FLL is independent of timer resolution, its output will still stabilize to within 10% after seeing about 50 events (in about 1.7 seconds). However, since the event interval is comparable to the resolution of the timer, an ILL will suffer loss of accuracy. In this example, the measured interval will be either 1, 2 or 3 ticks, depending on the relative timing between the clock and input. Thus the ILL's output can have an error of as much as 50%.

Generally, slow input rates and high resolution timers favor ILL, while high input rates and low resolution timers favor FLL. Sometimes the problem at hand forces a particular choice. For example, in queue handling procedures, the number of get-queue operations must equal the number of put-queue operations. This forces the use of an FLL, since the actual number of events control the actions. In another example, subdivision of a time interval (like in the disk sector finder), an ILL is best.

6.3 Uses of Feedback in Synthesis

We have used feedback-based scheduling policies for a wide variety of purposes in Synthesis. These are:

6.3.1 Real-Time Signal Processing

Synthesis uses the FLL idea in its thread scheduler. This enables a pipeline of threads to process high-rate, real-time data streams and simplifies the programming of signal-processing applications. The idea is quite simple: if a thread's input queue is filling or if its output queue is emptying, increase its share of CPU. Conversely, if a thread's input queue is emptying or if its output queue is filling, decrease its share of CPU. The effect of this scheduling policy is to allocate enough CPU to each thread in the pipeline so it can process its data. Threads connected to the high-speed Sound-IO devices find their input queues being filled -- or their output queues being drained -- at a high rate. Consequently, their share of CPU increases until the rate at which they process data equals the rate that it arrives. As these threads run and produce output, the downstream threads find that their queues start to fill, and they too receive more CPU. As long as the total CPU necessary for the entire pipeline does not exceed 100%, the pipeline runs in real-time.

    char	buf[100];
    int		n, fd1, fd2;

    fd1 = open("/dev/cd", 0);
    fd2 = open("/dev/speaker", 1);
    for(;;) {
	n = read(fd1, buf, 100);
	write(fd2, buf, n);

Figure 6.7: Program to Play a CD

The simplification in applications programming that occurs using this scheduler cannot be overstated. One no longer needs to worry about assigning priorities to jobs, or of carefully crafting the inner loops so that everything is executed frequently enough. For example, in Synthesis, reading from the CD player is no different than reading from any other device or file. Simply open "/dev/cd" and read from it. To listen to the CD player, one could use the program in Figure 6.7. The scheduler FLL keeps the data flowing smoothly at the 44.1 KHz sampling rate -- 176 kilobytes per second for each channel -- regardless of how many CPU-intensive jobs might be executing in the background.

Several music-oriented signal-processing applications have been written for Synthesis and run in real-time using the FLL-based thread scheduler. The Synthesis music and signal-processing toolkit includes many simple programs that take sound input, process it in some way, and produce sound output. These include delay elements, echo and reverberation filters, adjustable low-pass, band-pass and high-pass filters, Fourier transform, and a correlator and feature extraction unit. These programs can be connected together in a pipeline to perform more complex sound processing functions, in a similar way that text filters in Unix can be cascaded using the shell's "--" notation. The thread scheduler ensures the pipeline runs in real-time.

6.3.2 Rhythm Tracking and The Automatic Drummer

Besides scheduling, the feedback idea finds use in the actual processing of music signals. In one application, a correlator extracts rhythm pulses from the music on a CD. These are fed to an ILL, which subdivides the beat interval and generates interrupts synchronized to the beat of the music. These interrupts are then used to drive a drum synthesizer, which adds more drum beats to the original music. The interrupts also adjust the delay in the reverberation unit making it equal to the beat interval of the music. You can also get pretty pictures synchronized to the music when you plot the ILL input versus output on a graphics display.

6.3.3 Digital Oversampling Filter

In another music application, an FLL is used to generate the timing information for a digital interpolation filter. A digital interpolator takes as input a stream of sampled data and creates additional samples between the original ones by interpolation. This oversampling increases the accuracy of analog reconstruction of digital signals. We use 4:1 oversampling, i.e. we generate 4 samples using interpolation from each CD sample. The CD player has a new data sample available 44,100 times per second, or one every 22.68 microseconds. The interpolated data output is four times this rate, or one every 5.67 microseconds.1 We use an FLL to generate an interrupt source at this rate, synchronized with the CD player. This also serves as an example of just how fine-grained the timing can be: an interrupt every 5.67 µs corresponds to over 175,000 interrupts per second.

1 This program runs on the Quamachine at 50 MHz clock rate.

6.3.4 Discussion

A formal analysis of fine-grain scheduling is beyond the scope of this dissertation. However, I would like to give readers an intuitive feeling about two situations: saturation and cheating. As the CPU becomes saturated, the FLL-based scheduler degrades gracefully. The processes closest to externally generated interrupts (device drivers) will still get the necessary CPU time. The CPU-intensive processes away from I/O interrupts will slow down first, as they should at saturation.

Another potential problem is cheating by consuming resources unnecessarily to increase priority. This is possible because fine-grain scheduling tends to give more CPU to processes that consume more. However, cheating cannot be done easily from within a thread or by cooperation of several threads. First, unnecessary I/O loops within a program does not help the cheater, since they do not speed up data flow in the pipeline of processes. Second, I/O within a group of threads only shifts CPU quanta within the group. A thread that reads from itself gains quanta for input, but loses the exact amount in the self-generated output. To increase the priority of a process, it must read from a real input device, such as the CD player. In this case, it is virtually impossible for the OS kernel to distinguish the real I/O from cheating I/O.

6.4 Other Applications

6.4.1 Clocks

The FLL provides integral stability. This means the long-term drift between the reference frame and generated interrupts tends to zero, even though any individual interval may differ from the reference. This is in contrast with differential stability, in which the consecutive intervals are all the same, but any systematic error, no matter how small, will accumulate into a long-term drift. To illustrate, the interval timers found on many machines provide good differential stability: all the intervals are of very nearly the same length. But they do not provide good integral stability: they do not keep good time.

The integral stability property of the FLL lets it increase the resolution of precise timing sources. The idea is to synchronize a higher-resolution but less precise timing device, such as the machine's interval timer, to the precise one. The input to the FLL would be an interrupt derived from a very precise source of timing, for example, from an atomic clock. The output is a new stream of interrupts occurring at some multiple of the input rate.

Suppose the atomic clock ticks once a second. If the FLL's rate divider, N , is set to 1000, then the FLL will subdivide the second-long intervals into milliseconds. The FLL adjusts the interval timer so that each 1=1000-th interrupt occurs as close to the "correct" time of arrival as possible given the resolution of the interval timer, while maintaining integral stability -- N interrupts out for every interrupt in. If the interval timer used exhibits good differential stability, as most interval timers do, the output intervals will be both precise and accurate.

But for this to work well, one must be careful to avoid the accumulation of round-off error when calculating successive intervals. A good rule-of-thumb to remember is: calculate based on elapsed time; not on intervals. Use differences of elapsed times whenever an interval is required. This is crucial to guaranteeing convergence. The sample FLL in figure 6.3 follows these guidelines.

To illustrate this, suppose that the hardware interval timer ticks every 0.543 microseconds.2 Using this timer, a millisecond is 1843.2 ticks long. But when scheduling using intervals, 1843.2 is truncated to 1843 since interrupts can happen only on integer ticks. This gains time. One second later, the FLL will compensate by setting the interval to 1844. But now it loses time. The FLL ends up oscillating between 1843 and 1844, and never converging. Since the errors accumulate all in the same direction for the entire second before the adjustment occurs, the resulting millisecond subdivisions are not very accurate.

2 This is a common number on machines that derive timing from the baud-rate generator used in serial communications.

A better way to calculate is this: let the desired interval (1/frequency) be a floatingpoint number and accumulate intervals into an elapsed-time accumulator using floatingpoint addition. Interrupts are scheduled by taking the integer part of the elapsed-time accumulator and subtracting it from the previous elapsed-time to obtain the integer interval. Once convergence is reached, 4 out of 5 interrupts will be scheduled every 1843 ticks and 1 out of 5 every 1844 ticks, evenly interspersed, averaging to 1843.2. Each interrupt will occur as close to the 1-millisecond mark as possible given the resolution of the timer (e.g., they will differ by at most ±0.272 µs). In practice, the same effect can be achieved using appropriately scaled integer arithmetic, and floating point arithmetic would not be used.

6.4.2 Real-Time Scheduling

The adaptive scheduling strategy might be improved further, possibly encompassing many hard real-time scheduling problems. Hard real-time scheduling is a harder problem than the real-time stream processing problem discussed earlier. In stream processing, each job has a small queue where data can sit if the scheduler makes an occasional mistake. The goal of fine-grain scheduling is to converge to the correct CPU assignments for all the jobs before any of the queues overflow or underflow. In contrast, hard real-time jobs must meet their deadline, every single time. Nevertheless, I believe that the feedback-based scheduling idea will find useful application in this area. In this section, I only outline the general idea, without offering proof or examples. For a good discussion of issues in real-time computing, see [29].

We divide hard-deadline jobs into two categories: the short ones and the long ones. A short job is one that must be completed in a time frame within an order of magnitude of interrupt and context switch overhead. For example, a job taking up to 100 microseconds would be a short job in Synthesis. Short jobs are scheduled as they arrive and run to completion without preemption.

Long jobs take longer than 100 times the overhead of an interrupt and context switch. In Synthesis this includes all the jobs that take more than 1 millisecond, which includes most of the practical applications. The main problem with long jobs is the variance they introduce into scheduling. If we always take the worst scenario, the resulting hardware requirement is usually very expensive and unused most of the time.

To use fine-grain scheduling policies for long jobs, we break down the long job into small strips. For simplicity of analysis we assume each strip to have the same execution time ET. We define the estimated CPU power to finish job J as:

(strips in J) * ET
Deadline(J) - Now

For a long job, it is not necessary to know ET exactly since the locked loop "measures" it and continually adjusts the schedule in lock step with the actual execution time. In particular, if Estimate(J) > 1 then we know from the current estimate that J will not make the deadline. If we have two jobs, A and B, with Estimate(A) + Estimate(B) > 1 then we may want to consider aborting the less important one and calling a short emergency routine to recover.

Unlike traditional hard-deadline scheduling algorithms, which either guarantee completion or nothing, fine-grain scheduling provides the ability to predict the deadline miss under dynamically changing system loads. I believe this is an important practical concern to real-time application programmers, especially in recovery from faults.

6.4.3 Multiprocessor and Distributed Scheduling

I also believe the adaptiveness of FLL promises good results in multiprocessor and distributed systems. But like in the previous section, the idea can be offered at this writing, but with little support. At the risk of oversimplification, I describe an example with fixed buffer size and execution time. Recognize that at a given a load, we can always find the optimal scheduling statically by calculating the best buffer size and CPU quantum. But I emphasize the main advantage of feedback: the ability to dynamically adjust towards the best buffer size and CPU quantum. This is important when we have a variable system load, jobs with variable demands, or a reconfigurable system with a variable number of CPUs.

Figure 6.8: Two Processors, Static Scheduling
disk read readwrite readwrite . . .
P1 execute execute execute . . .
P2 execute execute . . .
time (ms) 50100 150200 250300 350400 450500

Figure 6.8 shows the static scheduling for a two-processor shared-memory system with a common disk (transfer rate of 2 MByte/second). We assume that both processes access the disk drive at the full transfer rate, e.g. reading and writing entire tracks. Process 1 runs on processor 1 (P1) and process 2 runs on processor 2 (P2). Process 1 reads 100 KByte from the disk into a buffer, takes 100 milliseconds to process them, and writes 100 KByte through a pipe into process 2. Process 2 reads 100 KByte from the pipe, takes another 100 milliseconds to process them, and writes 100 KByte out to disk. In the figure, process 1 starts to read at time 0. All disk activities appear in the bottom row, P1 and P2 show the processor usage, and shaded quadrangles show idle time.

Figure 6.9 shows the fine-grain scheduling mechanism (using FLL) for the same system. We assume that process 1 starts by filling its 100 KByte buffer, but soon after it starts to write to the output pipe, process 2 starts. Both processes run to exhaust the buffer, when process 1 will read from the disk again. After some settling time, depending on the filter used in the locked loop, the stable situation is for the disk to remain continuously active, alternatively reading into process 1 and writing from process 2. Both processes will also run continuously, with the smallest buffer that maintains the nominal transfer rate.

The above example illustrates the benefits of fine-grain scheduling policies in parallel processing. In a distributed environment, the analysis is more complicated due to network message overhead and variance. In those situations, calculating statically the optimal scheduling becomes increasingly difficult. We expect the fine-grain scheduling to show increasing usefulness as it adapts to an increasingly complicated environment.

Figure 6.9: Two Processors, Fine-Grain Scheduling
disk r r  rw rwr wrw rwr wrw rwr wrw rwr . . .
p1 ex exexex exexex exexex exexex . . .
p2 exexex exexex exexex exexex . . .
time (ms) 50 100 150 200 250 300 350 400 450 500

Another application of FLL to distributed systems is clock synchronization. Given some precise external clocks, we would like to synchronize the rest of machines with the reference clocks. Many algorithms have been published, including a recent probabilistic algorithm by Christian [10]. Instead of specialized algorithms, we use an FLL to synchronize clocks, where the external clock is the reference frame, the message delays introduce the jitter in the input, and we need to find the right combination of filters to adapt the output to the varying message delays. Since an FLL exhibits integral stability, the clocks will tend to synchronize with the reference once they stabilize. We are currently collecting data on the typical message delay distributions and finding the appropriate filters for them.

6.5 Summary

We have generalized scheduling from job assignments as a function of time, to job assignments as a function of any source of interrupts. The generalized scheduling is most useful when we have fine-grain scheduling, that uses frequent state checks and dispatching actions to adapt quickly to system changes. Relevant new applications of the generalized fine-grain scheduling include I/O device management, such as a disk sector interrupt source, and adaptive scheduling, such as real-time scheduling and distributed scheduling.

The implementation of fine-grain scheduling in Synthesisis based on feedback systems, in particular the phase locked loop. Synthesis' fine-grain scheduling policy means adjustments every few hundreds of microseconds on local information, such as the number of characters waiting in an input queue. Very low overhead scheduling and context switch for dispatching form the foundation of our fine-grain scheduling mechanism. In addition, we have very low overhead interrupt processing to allow frequent checks on the job progress and quick, small adjustments to the scheduling policy.

There are two main advantages of fine-grain scheduling: quick adjustment to changing situations, and early warning of potential deadline misses. Quick adjustments make better use of system resources, since we avoid queue/buffer overflow and other mismatches between the old scheduling policy and the new situation. Early warning of deadline misses allows real-time application programmers to anticipate a disaster and attempt an emergency recovery before the disaster strikes.

We have only started exploring the many possibilities that generalized fine-grain scheduling offers. Distributed applications stand to benefit from the locked loops, since they can track the input interrupt stream despite jitters introduced by message delays. Concrete applications we are studying include load balancing, distributed clock synchronization, smart caching in memory management and real-time scheduling. To give one example, load balancing in a real-time distributed system can benefit greatly from fine-grain scheduling, since we can detect potential deadline misses in advance; if a job is making poor progress towards its deadline locally, it is a good candidate for migration.