UNFAIR CPU SCHEDULING

Continued

This co-runner-dependent performance variability can create the following problems:

(1)   Unfair CPU sharing: Conventional schedulers time slicing for multicore processors. To achieve fair sharing on these processors, L2 cache allocation must be considered. This problem is actually more difficult than fair sharing in a shared-memory multiprocessor, where a thread’s performance similarly depends on how much of the shared memory it gets. The fundamental difference is that the operating system can observe and control multiprocessor memory allocation while L2 cache allocation is completely opaque to the operating system. We present a new scheduling algorithm, the cache-fair algorithm, that addresses unequal cache allocation and reduces co-runner-dependent performance variability.

(2)   This algorithm redistributes CPU time to threads to account for unequal cache sharing: if a thread’s performance decreases due to unequal cache sharing it gets more time, and vice versa. The challenge in implementing this algorithm is determining how a thread’s performance is affected by unequal cache sharing using limited information from the hardware. Our solution uses runtime statistics and analytical models and does not require new hardware structures or operating system control over cache allocation. We implemented our algorithm as an extension to the Solaris10 operating system and evaluated it on a full-system hardware simulator of the Ultra SPARC.

           We demonstrate that our algorithm significantly reduces co-runner-dependent performance variability: by at least a factor of two and by as much a factor of seven in the cases where the variability is significant. Co-runner-dependent performance is the result of unequal cache sharing, and by eliminating it, we address the problems caused by unequal cache sharing:

(1) Unfair CPU sharing: With our algorithm, an application achieves predictable forward progress regardless of its co-runners. The effects of unfair cache sharing are negated.

(2) Poor priority enforcement: Our algorithm ensures that a thread makes predictable forward progress regardless of its co-runner. Therefore, elevating a thread’s priority results in greater forward progress, and vice versa, just like on conventional processors. Priorities are properly enforced.

 (3) Inaccurate CPU accounting: Our algorithm reduces dependency of a thread’s performance on its co-runner. Charging for CPU hours is appropriate, because the amount of work accomplished by a thread is proportional to the CPU hours paid for by the customer and is not affected by the co-runner.

PERFORMANCES OF DATA FLOW COMPUTATIONS SCHEDULED BY WORK-STEALING

                In a detailed analysis of this work-stealing algorithm is presented. We invite reader interested in the proof of these results to consult these publications.

EXTENSION TO THREAD PARTITIONING

            Work-stealing scheduling is efficient for multi-threaded computations. Nevertheless, for iterative applications in numerical computations, the natural way is to describe the computation using loops. Even if algorithms may be rewritten using a recursive scheme to get benefit from work-stealing performances, from the point of view of the KAAPI runtime system it is natural to provide functionalities to programmers that enable their natural iterative description of the computations. 

            In KAAPI, we have designed a method to partition the data flow graph representing the work executed by a thread. The key point is to compute the schedule of tasks from the data flow graph: For each task, it computes the site of execution; and for each site, the order in which tasks are executed. The computation of a schedule is a plugin in KAAPI.

            Currently we have collected existing graph scheduling libraries, such as DSC , ETF  as well as libraries that partition data dependencies graphs (Metis and Scotch) which are efficient to solve mesh based problems occurring in most of scientific numerical simulations. These latter libraries are used in KAAPI to compute the data mapping by partitioning the sub graph of data dependencies from the data flow graph. Then, the site of execution of the tasks is deduced from the access they made to data in the global memory following an owner compute rule strategy. 

                Once the schedule is computed, the thread is partitioned in k threads that form a group, which are distributed among the virtual processors on the processes. Communications between threads are based on added tasks in each thread: a task broadcast has a read access to data in the global memory and, on execution, it sends the value to destination; a task receives has a write access to data in the global memory and is terminated upon reception of value. Moreover, this enables to use the same work-stealing scheduling runtime support for managing communication as for executing multithreaded computations. Iterative computations reuse the group of threads by restoring their initial states. Redistribution of data occurs between iteration and are managed using our active message communication support.

WORK-STEALING VERSUS THREAD PARTITIONING SCHEDULING ON MULTI

            Processors In this experience, we run a virtual simulation application on top of Sofa comparing several parallel implementations. The objective is to have short delays in computation (around 30ms) in order to produce smooth image rate. The first implementation is based on pure work-stealing scheduling where a main thread generates few tasks that will themselves fork most of the computation. Due to the short delay of the computation, the bottleneck in the scheduling is the access to the main thread by idle processors.

            Using thread partitioning technique, the first level of computation starts while workload has already been distributed. The speed is measure on a dual-core 8-processor machine (16 cores) for sequential computation time of about 30ms. The iterative simulation is unrolled either on 10 loops or 20 loops.

CONCLUSIONS

            For hybrid multiprocessors built with multithreaded processors, it is necessary to consider both levels of parallelism, intra processor and inter processor, while designing scheduling policies. This paper introduced new scheduling policies that collect at run-time information about hardware metrics related to the sharing of resources between threads.

            This information is used to tune the schedule by educated placement of threads on execution contexts. Tuning is motivated by the intuition that pairing jobs with high demands on shared resources with jobs with low demands on shared resources is likely to improve performance and resource utilization. Thread pairing decisions on each processor are made based on one of three types of information: bus transaction rate, cache miss rate, and stall cycle rate per thread. All these metrics provide insight into the performance and interaction of co-scheduled threads on a single processor.

            These experiments indicate that bus transaction rates are more informative than stall cycle rates, which are in turn more informative than cache misses. The scheduling policies use no sampling or historical accumulation of the performance of specific pairings. They start with a random thread placement that tries to achieve fair sharing of processors and converge progressively to a better placement via educated decisions to migrate threads and alleviate contention. 

           We used workloads consisting of the NAS Parallel Benchmarks with two different degrees of multiprogramming and combinations of only sequential, only multithreaded, or both sequential and multithreaded jobs. The evaluation focused on assessing the impact of the policies on job performance in the workloads. The policies we introduce achieved up to a 28.7% performance improvement over the native Linux SMP scheduler, and outperformed the Linux scheduler in 16 out of 18 cases and a performance oblivious gang scheduler in 13 out of 18 cases. Ongoing work of ours addresses the problem of thread pairing for jobs that use both computation and communication.       

        Early evaluation of thread pairing heuristics that take into consideration the role of each thread (communication or computation) revealed substantial (over 30%) performance improvements over the Linux scheduler. We are investigating ways to incorporate these heuristics in the processor manager and derive thread characterization automatically from hardware metrics. We also plan to investigate the impact of I/O and memory pressure in our future work. In parallel, we plan to adapt our processor manager to dynamically intercept library calls from third-party and OpenMP libraries, so that our scheduling policies become ubiquitous.

            Multithreaded computations may take benefit of the description of non strict data dependencies. In this paper we present the KAAPl approach to implement efficient multithreaded computations with data flow dependencies built at runtime. It is based on a small extension of POSIX thread interface and a basic work-stealing algorithm. Efficiency relies on the work-stealing algorithm as well as an original implementation directed by the work first principle that allows to report the cost to compute the state of the data flow graph on the critical path execution rather than on the work of the program.

          Experimental results show the feasibility of the approach. Work overhead is reasonably small and the work-stealing algorithm permits to reach good efficiencies for medium grain on both a multi-cores / multi-processors architecture and a PCs cluster. Moreover it scales on (heterogeneous) grid architecture as reported by the N-Queens result. 

           Our thread partitioning approach, using graph scheduling and partitioning algorithms, is dedicated for iterative applications in numerical simulation. It is used for various applications: dynamic molecular Tuktut, cloth simulation Sappe  and Sofa. Moreover, our past collaboration in the Lin Box library is under active development to produce several parallel algorithms, especially for a large sparse matrix-vector product over finite field arithmetic. 

            Thanks to the internal data flow representation, KAAPI maintains an abstract representation of the application execution. This representation has been used to develop original and efficient fault tolerance protocol with work-stealing or for iterative application with application to adaptive computation. Ongoing work is to target the KAAPI runtime on top an high speed network and managing internal data structure with affinity on the physical processors for NUMA architectures.

Post a Comment

10 Comments