ABSTRACT
The
high availability of multiprocessor clusters for computer science seems to be
very attractive to the engineer because, at a first level, such computers
aggregate high performances. Nevertheless, obtaining peak performances on
irregular applications such as computer algebra problems remains a challenging
problem. The delay to access memory is non uniform and the irregularity of
computations requires to use scheduling algorithms in order to automatically
balance the workload among the processors. This paper focuses on the runtime
support implementation to exploit with great efficiency the computation
resources of a multiprocessor cluster. The originality of our approach relies
on the implementation of an efficient work stealing algorithm for a macro data
flow computation based on minor extension of POSIX thread interface.
INTRODUCTION
Multithreaded languages have been proposed as a general approach to model dynamic, unstructured parallelism. They include data parallel ones, macro dataflow – Athapascan , languages with fork-join based constructs or with additional synchronization primitives. Efficient execution of a multithreaded computation on a parallel computer relies on the schedule of the threads among the processors. In the work-stealing scheduling, each processor gets its own stack implemented by a deque of frames. When a thread is created, a new frame is initialized and pushed at the bottom of the deque. When a processor become idle, it tries to steal work from a victim processor with a non-empty deque; then it takes the topmost frame (the least recently pushed frame) and places it in its own deque which is linked to the victim deque. Such scheduling has been proven to be efficient for fully strict multithreaded computations while requiring a bounded memory space with respect to a depth first sequential execution.
In Cilk, the implementation is highly optimized and follow the work
first principle that aims at moving most of the scheduling cost of from the
work of the execution to the rare case during work-stealing operations. Lock
free algorithm manages the access to the deque allowing to reach good
efficiency even for fine grain application. Cilk is one of the best known
system that allows developers to focus on the algorithm rather than the hardware.
Nevertheless, Cilk targets are shared memory machines. Although a past version
of Cilk, called Cilk-NOW, was designed for network of workstations, the current
highly optimized Cilk cannot runs on multiprocessor cluster.
The underlying DAG consistency memory model remains challenging to implement for large scale clusters. Data flow based languages do not suffer from penalty due to memory coherence protocol, mainly because data movement between instructions are explicit in the model. Since the 70s, data flow models have moved to multi-threaded architecture where key points of original model have been conserved in order to hide latency of (remote) memory access.
The EART Threaded-C language exposes
two level of threads (standard threads and fiber threads) that allows a fine
control over the effective parallelism. Fibers are executed using a data flow
approach: a synchronization unit detects and schedules fibers that have all
their inputs produced. In, the authors present a thread partitioning and
scheduling algorithm to statically gather instructions into threads. The paper
focuses on the implementation of KAAPI, a runtime support for scheduling
irregular macro data flow programs on a cluster of multi-processors using a work
stealing algorithm. This section presents the high level and low level programming
interfaces.
We showed how the internal data flow representation is built at runtime with low overhead. Moreover, we explain our lazy evaluation of readiness properties of the data flow graph following the work first principle of Cilk. This is similar to hybrid data flow architecture which allows to group together tasks for a sequential execution in order to reduce scheduling overhead. Nevertheless, KAAPI permits to reconsider at runtime group of tasks in order to extract more parallelism. This section deals with the runtime system and the scheduling of KAAPI threads on the processors.
The work-stealing
algorithm relies on an extension of POSIX thread interface that allows work
stealing. Next, we present the execution of data flow graph as an application
of the general work-stealing algorithm when threads are structured as lists of
tasks. This section ends with the presentation of the thread partitioning to
distribute work among processors prior to the execution. The interaction with
the communication sub-system to send messages over the network concludes this
section. Section 4 reports experiments on multi-core / multi-processor
architectures, clusters and grids up to 1400 processors.
RELATED WORK
There is a large body of related work including: arranging data structures to maximize locality of reference cache (the cache connecting to the DRAM system). The overhead of the thread primitive operations is too high to schedule threads to reduce first level cache misses. On the other hand, when static memory reference information available, the tiling method used in compilers can take advantage of registers, first-level, and second-level caches.
These experiments are limited to 3
address hints, though the scheduling algorithm can use k (where k > 3)
addresses as hints to schedule threads. Applications in our experiments also do
not need more than three addresses. The thread package supports only
independent, run-to-completion" threads. Such an application programming
interface can be conveniently used to program asynchronous iterative
algorithms. However, it would not be convenient to program algorithms that have
complex dependencies. Methods to specify dependencies and ways to implement
them efficiently remain to be demonstrated.
The experiments are also limited. We
only experimented with four applications. We compared our results with only the
tiled versions by the KAP compiler or the new SGI compiler. It would be more
useful to compare our results with the results from a compiler with more
advanced tiling techniques.
The cache simulations for the R8000 system could only provide us with memory reference and cache miss information. The simulation does not take many important details into account. For example, it works with virtual addresses whereas the L2 cache uses physical addresses. It does not model the cache hierarchy in detail so that there is no information on CPU stalls due to the pressures on caches. Crude analysis could only offer implications where saved time comes from.
To figure out exactly where time goes, it would require detailed
simulations of the R8000 CPU pipelines and its memory hierarchy. Finally, we
did not perform cache simulations for the R10000 and thus we are unable to
analyze the performance results on that machine in detail.

2 Comments
very informative, creative and research work
ReplyDeleteAbstract is so much important
ReplyDelete