SCHEDULING THREAD ALGORITHMS THE CACHE FAIR SCHEDULING

               The overhead of executing a program with data flow representation is mainly due closure and access object creation cost. If the transition failed due to the fact that initial state is stolen (the closure has been theft), then the running thread suspends its execution until the state of the closure becomes terminated. Then the scheduler thread is activated.

            If this deque is empty, the thread terminates and the virtual processor switches to the scheduler thread, as explained in previous section. The steal function iterates through the bottom deque of the victim thread and tries to steal the first ready to execute closure not marked running. For each closure, the algorithm updates the state of all its accesses: If the access is marked ready and the immediate successor is an access with a concurrent mode of access to the global memory, then the successor is set to ready.

            If the mode of the access is write, the access is set to ready. When all accesses have been visited, and readiness propagated to successors, the closure is ready when all its accesses are ready. The steal function attribute is called by the scheduler thread in case of inactivity. Once a ready victim closure is detected, the steal function tries to change its state from created to stolen.

            If the transition is successful, then a new KAAPI thread is created with same attribute than the victim thread. The new thread contains an initial frame with a copy of the closure, in the state created, and a signalization closure with purpose is 1 to transfer back the output data to the victim closure and 2to signal the completion of the victim closure, i.e. it moves its state from stolen to terminated. Figure 4 illustrates the stacks of both the victim thread and the thief thread after a successful attempt to steal a closure. The signalization closure is built with an exclusive access link after each access of the copied closure. By the way a thread is waiting for completion of closure, the global computation is terminated when the main thread finishes to execute all its closures. Thus, the global termination detection does not need a distributed algorithm such as in. It is implicit in our work-stealing implementation.

THREAD PACKAGE

                This section describes a thread package for data locality scheduling. The thread package consists of a thread scheduler that conforms to the algorithm described in the previous section, with primitive operations optimized to support integrand, user-level threads. In principle, a general purpose thread package could be adapted to schedule threads according to memory access hints. 

            While significant variation exists among current thread packages, they all include support for synchronization, including context switching and thread state saving. Many thread packages include additional features, such as preemptive scheduling, asynchronous signaling mechanisms, and private data areas; such features can be valuable in some contexts, expanding the class of problems for which a package is applicable.

            However, our design for locality scheduling keeps the thread package simple, making low-overhead the most important goal. We have implemented a special-purpose thread system for locality scheduling that features a minimal user interface and very low overhead. Because our threads always run-to-completion," and have no support for blocking and resuming execution, the system is implemented without resorting to assembly language. In fact, the essential functionality of the library is embodied in about 525 lines of C, including both C and Fortran-callable interfaces. 

             However, this does not imply that it is completely free of machine and configuration dependencies. In particular, as explained above, the cache size is an important parameter of the scheduling algorithm. Our thread package implements the scheduling algorithm for the three-dimensional case, although it is quite easy to extend it to higher dimensional cases. It uses three-dimensional blocks to determine data locality to place threads into bins.

IMPLEMENTATION

            The thread package is implemented with four basic data structures: thread group, bin, hash table, and ready list. Which shows the relationships of these data structures.

               The thread group data structure represents a number of threads within a bin; by grouping threads together in this way, amortization reduces the cost of thread structure management. Since the thread package supports only non-preemptive scheduling, the data required to represent each thread is very simple: a void function pointer and the two arguments arg1 and arg2 supplied by the user to thread fork. The thread group consists of an array of these structures plus an integer to count the number of threads actually in the group and a pointer to the next thread group in the bin.

            Recall that the cache-fair scheduler enforces an existing scheduling policy. We implemented our cache-fair scheduler in the Solaris 10 operating system on top of its fixed-priority scheduling policy (each thread has a fixed priority). 

              Like conventional schedulers, our scheduler runs on every system clock tick for each running thread. It performs the housekeeping needed by the baseline fixed-priority policy for that thread. It also performs cache-fair specific processing, depending on the phase that the thread is undergoing. 

            Each cache-fair thread goes through two phases, the reconnaissance phase and the calibration phase. In this section, we describe how the phases are implemented, how often they are repeated and, in particular, how we obtain the inputs for the analytical models used inside the algorithm. Most inputs are obtained using hardware performance counters, which allow measuring thread specific runtime statistics, such counters are typically available on modern processors, so our implementation is not limited to a specific hardware architecture.

 CACHE-FAIR THREAD SCHEDULING FOR MULTICORE PROCESSORS

            Conventional CPU schedulers make assumptions that do not apply on multicore processors: they assume that the CPU is a single, indivisible resource and that if threads are granted equal time slices, those threads will share the CPU equally. On multicore processors, concurrently running threads, or co-runners, often share a single second-level (L2) cache, and cache allocation is controlled by the hardware. 

            Cache sharing depends solely on the cache needs of the co runner(s), and unfair cache sharing occurs often. A thread’s cache occupancy affects its cache miss rate, and, as a result, impacts the rate at which the thread retires instructions. Therefore, a thread’s CPU performance significantly varies depending on its co runner.

Post a Comment

2 Comments