Continued
This
co-runner-dependent performance variability can create the following problems:
(1) Unfair
CPU sharing: Conventional schedulers time slicing for multicore processors. To
achieve fair sharing on these processors, L2 cache allocation must be
considered. This problem is actually more difficult than fair sharing in a
shared-memory multiprocessor, where a thread’s performance similarly depends on
how much of the shared memory it gets. The fundamental difference is that the
operating system can observe and control multiprocessor memory allocation while
L2 cache allocation is completely opaque to the operating system. We present a
new scheduling algorithm, the cache-fair algorithm, that addresses unequal
cache allocation and reduces co-runner-dependent performance variability.
(2) This
algorithm redistributes CPU time to threads to account for unequal cache
sharing: if a thread’s performance decreases due to unequal cache sharing it
gets more time, and vice versa. The challenge in implementing this algorithm is
determining how a thread’s performance is affected by unequal cache sharing
using limited information from the hardware. Our solution uses runtime statistics
and analytical models and does not require new hardware structures or operating
system control over cache allocation. We implemented our algorithm as an
extension to the Solaris10 operating system and evaluated it on a full-system
hardware simulator of the Ultra SPARC.
We
demonstrate that our algorithm significantly reduces co-runner-dependent
performance variability: by at least a factor of two and by as much a factor of
seven in the cases where the variability is significant. Co-runner-dependent performance
is the result of unequal cache sharing, and by eliminating it, we address the
problems caused by unequal cache sharing:
(1) Unfair CPU sharing: With our algorithm, an application
achieves predictable forward progress regardless of its co-runners. The effects
of unfair cache sharing are negated.
(2) Poor priority enforcement: Our algorithm ensures that a
thread makes predictable forward progress regardless of its co-runner.
Therefore, elevating a thread’s priority results in greater forward progress,
and vice versa, just like on conventional processors. Priorities are properly
enforced.
(3) Inaccurate CPU
accounting: Our algorithm reduces dependency of a thread’s performance on its
co-runner. Charging for CPU hours is appropriate, because the amount of work
accomplished by a thread is proportional to the CPU hours paid for by the
customer and is not affected by the co-runner.
PERFORMANCES OF DATA FLOW COMPUTATIONS SCHEDULED BY WORK-STEALING
In a detailed analysis of this work-stealing algorithm is
presented. We invite reader interested in the proof of these results to consult
these publications.
EXTENSION
TO THREAD PARTITIONING
Work-stealing scheduling is efficient for multi-threaded computations. Nevertheless, for iterative applications in numerical computations, the natural way is to describe the computation using loops. Even if algorithms may be rewritten using a recursive scheme to get benefit from work-stealing performances, from the point of view of the KAAPI runtime system it is natural to provide functionalities to programmers that enable their natural iterative description of the computations.
In KAAPI, we have designed a method to partition the data flow
graph representing the work executed by a thread. The key point is to compute
the schedule of tasks from the data flow graph: For each task, it computes the
site of execution; and for each site, the order in which tasks are executed.
The computation of a schedule is a plugin in KAAPI.
Currently we have collected existing graph scheduling libraries, such as DSC , ETF as well as libraries that partition data dependencies graphs (Metis and Scotch) which are efficient to solve mesh based problems occurring in most of scientific numerical simulations. These latter libraries are used in KAAPI to compute the data mapping by partitioning the sub graph of data dependencies from the data flow graph. Then, the site of execution of the tasks is deduced from the access they made to data in the global memory following an owner compute rule strategy.
Once the schedule is computed, the thread is partitioned
in k threads that form a group, which are distributed among the virtual
processors on the processes. Communications between threads are based on added
tasks in each thread: a task broadcast has a read access to data in the global
memory and, on execution, it sends the value to destination; a task receives
has a write access to data in the global memory and is terminated upon
reception of value. Moreover, this enables to use the same work-stealing
scheduling runtime support for managing communication as for executing
multithreaded computations. Iterative computations reuse the group of threads
by restoring their initial states. Redistribution of data occurs between
iteration and are managed using our active message communication support.
WORK-STEALING
VERSUS THREAD PARTITIONING SCHEDULING ON MULTI
Processors In this experience, we
run a virtual simulation application on top of Sofa comparing several parallel
implementations. The objective is to have short delays in computation (around
30ms) in order to produce smooth image rate. The first implementation is based
on pure work-stealing scheduling where a main thread generates few tasks that
will themselves fork most of the computation. Due to the short delay of the
computation, the bottleneck in the scheduling is the access to the main thread
by idle processors.
Using thread partitioning technique,
the first level of computation starts while workload has already been
distributed. The speed is measure on a dual-core 8-processor machine (16 cores)
for sequential computation time of about 30ms. The iterative simulation is
unrolled either on 10 loops or 20 loops.
CONCLUSIONS
For
hybrid multiprocessors built with multithreaded processors, it is necessary to
consider both levels of parallelism, intra processor and inter processor, while
designing scheduling policies. This paper introduced new scheduling policies
that collect at run-time information about hardware metrics related to the
sharing of resources between threads.
This
information is used to tune the schedule by educated placement of threads on
execution contexts. Tuning is motivated by the intuition that pairing jobs with
high demands on shared resources with jobs with low demands on shared resources
is likely to improve performance and resource utilization. Thread pairing
decisions on each processor are made based on one of three types of
information: bus transaction rate, cache miss rate, and stall cycle rate per
thread. All these metrics provide insight into the performance and interaction
of co-scheduled threads on a single processor.
These experiments indicate that bus transaction rates are more informative than stall cycle rates, which are in turn more informative than cache misses. The scheduling policies use no sampling or historical accumulation of the performance of specific pairings. They start with a random thread placement that tries to achieve fair sharing of processors and converge progressively to a better placement via educated decisions to migrate threads and alleviate contention.
We used workloads consisting of the NAS Parallel Benchmarks with two different degrees of multiprogramming and combinations of only sequential, only multithreaded, or both sequential and multithreaded jobs. The evaluation focused on assessing the impact of the policies on job performance in the workloads. The policies we introduce achieved up to a 28.7% performance improvement over the native Linux SMP scheduler, and outperformed the Linux scheduler in 16 out of 18 cases and a performance oblivious gang scheduler in 13 out of 18 cases. Ongoing work of ours addresses the problem of thread pairing for jobs that use both computation and communication.
Early evaluation of thread pairing heuristics that take into consideration the role of each thread (communication or computation) revealed substantial (over 30%) performance improvements over the Linux scheduler. We are investigating ways to incorporate these heuristics in the processor manager and derive thread characterization automatically from hardware metrics. We also plan to investigate the impact of I/O and memory pressure in our future work. In parallel, we plan to adapt our processor manager to dynamically intercept library calls from third-party and OpenMP libraries, so that our scheduling policies become ubiquitous.
Multithreaded computations may take
benefit of the description of non strict data dependencies. In this paper we
present the KAAPl approach to implement efficient multithreaded computations
with data flow dependencies built at runtime. It is based on a small extension
of POSIX thread interface and a basic work-stealing algorithm. Efficiency
relies on the work-stealing algorithm as well as an original implementation
directed by the work first principle that allows to report the cost to compute
the state of the data flow graph on the critical path execution rather than on
the work of the program.
Experimental results show the feasibility of the approach. Work overhead is reasonably small and the work-stealing algorithm permits to reach good efficiencies for medium grain on both a multi-cores / multi-processors architecture and a PCs cluster. Moreover it scales on (heterogeneous) grid architecture as reported by the N-Queens result.
Our thread partitioning approach, using graph scheduling and partitioning algorithms, is dedicated for iterative applications in numerical simulation. It is used for various applications: dynamic molecular Tuktut, cloth simulation Sappe and Sofa. Moreover, our past collaboration in the Lin Box library is under active development to produce several parallel algorithms, especially for a large sparse matrix-vector product over finite field arithmetic.
Thanks to the
internal data flow representation, KAAPI maintains an abstract representation
of the application execution. This representation has been used to develop
original and efficient fault tolerance protocol with work-stealing or for
iterative application with application to adaptive computation. Ongoing work is
to target the KAAPI runtime on top an high speed network and managing internal
data structure with affinity on the physical processors for NUMA architectures.
10 Comments
Threads are very well explained
ReplyDeleteAbsolutely
DeleteVery informative 👏
ReplyDeletevery helpful data
ReplyDeletefantastic
ReplyDeleteone of the best text material
ReplyDeleteAbsolutely right
DeleteInformative
ReplyDeleteVery good
ReplyDeleteinteresting information
ReplyDelete