SCHEDULING THREAD ALGORITHMS FOR HYBRID MULTIPROCESSORS

ABSTRACT 

            The high availability of multiprocessor clusters for computer science seems to be very attractive to the engineer because, at a first level, such computers aggregate high performances. Nevertheless, obtaining peak performances on irregular applications such as computer algebra problems remains a challenging problem. The delay to access memory is non uniform and the irregularity of computations requires to use scheduling algorithms in order to automatically balance the workload among the processors. This paper focuses on the runtime support implementation to exploit with great efficiency the computation resources of a multiprocessor cluster. The originality of our approach relies on the implementation of an efficient work stealing algorithm for a macro data flow computation based on minor extension of POSIX thread interface.

INTRODUCTION

            Multithreaded languages have been proposed as a general approach to model dynamic, unstructured parallelism. They include data parallel ones, macro dataflow – Athapascan , languages with fork-join based constructs  or with additional synchronization primitives.  Efficient execution of a multithreaded computation on a parallel computer relies on the schedule of the threads among the processors. In the work-stealing scheduling, each processor gets its own stack implemented by a deque of frames. When a thread is created, a new frame is initialized and pushed at the bottom of the deque. When a processor become idle, it tries to steal work from a victim processor with a non-empty deque; then it takes the topmost frame (the least recently pushed frame) and places it in its own deque which is linked to the victim deque. Such scheduling has been proven to be efficient for fully strict multithreaded computations while requiring a bounded memory space with respect to a depth first sequential execution. 

        In Cilk, the implementation is highly optimized and follow the work first principle that aims at moving most of the scheduling cost of from the work of the execution to the rare case during work-stealing operations. Lock free algorithm manages the access to the deque allowing to reach good efficiency even for fine grain application. Cilk is one of the best known system that allows developers to focus on the algorithm rather than the hardware. Nevertheless, Cilk targets are shared memory machines. Although a past version of Cilk, called Cilk-NOW, was designed for network of workstations, the current highly optimized Cilk cannot runs on multiprocessor cluster.

       The underlying DAG consistency memory model remains challenging to implement for large scale clusters. Data flow based languages do not suffer from penalty due to memory coherence protocol, mainly because data movement between instructions are explicit in the model. Since the 70s, data flow models have moved to multi-threaded architecture where key points of original model have been conserved in order to hide latency of (remote) memory access. 

        The EART Threaded-C language exposes two level of threads (standard threads and fiber threads) that allows a fine control over the effective parallelism. Fibers are executed using a data flow approach: a synchronization unit detects and schedules fibers that have all their inputs produced. In, the authors present a thread partitioning and scheduling algorithm to statically gather instructions into threads. The paper focuses on the implementation of KAAPI, a runtime support for scheduling irregular macro data flow programs on a cluster of multi-processors using a work stealing algorithm. This section presents the high level and low level programming interfaces.

       We showed how the internal data flow representation is built at runtime with low overhead. Moreover, we explain our lazy evaluation of readiness properties of the data flow graph following the work first principle of Cilk. This is similar to hybrid data flow architecture which allows to group together tasks for a sequential execution in order to reduce scheduling overhead. Nevertheless, KAAPI permits to reconsider at runtime group of tasks in order to extract more parallelism. This section deals with the runtime system and the scheduling of KAAPI threads on the processors. 

      The work-stealing algorithm relies on an extension of POSIX thread interface that allows work stealing. Next, we present the execution of data flow graph as an application of the general work-stealing algorithm when threads are structured as lists of tasks. This section ends with the presentation of the thread partitioning to distribute work among processors prior to the execution. The interaction with the communication sub-system to send messages over the network concludes this section. Section 4 reports experiments on multi-core / multi-processor architectures, clusters and grids up to 1400 processors.

RELATED WORK

        There is a large body of related work including: arranging data structures to maximize locality of reference cache (the cache connecting to the DRAM system). The overhead of the thread primitive operations is too high to schedule threads to reduce first level cache misses. On the other hand, when static memory reference information available, the tiling method used in compilers can take advantage of registers, first-level, and second-level caches.

        These experiments are limited to 3 address hints, though the scheduling algorithm can use k (where k > 3) addresses as hints to schedule threads. Applications in our experiments also do not need more than three addresses. The thread package supports only independent, run-to-completion" threads. Such an application programming interface can be conveniently used to program asynchronous iterative algorithms. However, it would not be convenient to program algorithms that have complex dependencies. Methods to specify dependencies and ways to implement them efficiently remain to be demonstrated.

      The experiments are also limited. We only experimented with four applications. We compared our results with only the tiled versions by the KAP compiler or the new SGI compiler. It would be more useful to compare our results with the results from a compiler with more advanced tiling techniques.

         The cache simulations for the R8000 system could only provide us with memory reference and cache miss information. The simulation does not take many important details into account. For example, it works with virtual addresses whereas the L2 cache uses physical addresses. It does not model the cache hierarchy in detail so that there is no information on CPU stalls due to the pressures on caches. Crude analysis could only offer  implications where saved time comes from. 

        To figure out exactly where time goes, it would require detailed simulations of the R8000 CPU pipelines and its memory hierarchy. Finally, we did not perform cache simulations for the R10000 and thus we are unable to analyze the performance results on that machine in detail.

Post a Comment

2 Comments