SCHEDULING THREAD ALGORITHMS PERFORMANCE MONITORING

Continued

Prior research on scheduling algorithms that take into consideration the interaction between threads or processes in processors is closely related to the work presented in this paper. Several seminal works on cache affinity scheduling used simulation, physical experimentation and analytical models to study the impact of time-sharing on cache performance.

            Heuristics for estimating cache affinity by measuring the time between consecutive executions of the same thread on the same processor are used in practically all shared-memory multiprocessor schedulers. Thread scheduling heuristics using accurate information on cache misses from the hardware have also been proposed for standalone multithreaded programs.

            The advent of simultaneous multithreading has stimulated research in all aspects of operating system support for processors using this technique. Symbiotic job scheduling, an idea proposed originally for scheduling on the Tera MTA architecture, has been proposed as a viable way for factoring the implications of sharing hardware resources between threads into the scheduling process. Symbiotic job scheduling relies on sampling the performance of different mixes of co-scheduled jobs, using different types of performance metrics, until a consensus on the best pairing is reached. This scheduling strategy may reach optimality for certain workloads but is very sensitive to their dynamic characteristics.

            The number of possible thread pairings grows quickly with the number of threads (or single-threaded jobs) in the workload and the number of execution contexts on the processor. The performance samples tend to get obsolete as jobs come and go, or when jobs move through different phases of their computation. The complexity grows even further if one needs to consider multiple hardware metrics to infer the interference in a given workload. Real processors can not support the simultaneous counting of many hardware events, therefore statistical sampling and averaging are necessary to gather all the information.

            This paper takes a different approach. Starting from a schedule selected to reduce interference between processors on the shared bus, we use hardware metrics to converge to a schedule that reduces thread interference within each processor. Hardware mechanisms proposed to improve the quality of service to threads in terms of instruction throughput relate closely to job scheduling issues and the policies we introduced in this paper, as they attempt to alleviate imbalances of service between threads by regulating the number of instruction issue slots used by each thread. The interference between threads in the shared cache has also been a focal point of investigation for multithreaded processors and chip multiprocessors.

            The problem presents several interesting challenges from the architectural perspective, but with respect to job scheduling, the most common technique used to alleviate thread interference is to partition the cache between processors or execution contexts. Most of these works propose static cache partitioning schemes, however dynamic schemes have also been proposed recently. There is a consensus among these works that cache partitioning tends to be beneficial for job scheduling, however the problem of identifying the right static or dynamic partitions to use in the cache is difficult.

PERFORMANCE MONITORING

THE CASE OF INTEL HYPER THREADED PROCESSORS

          Most modern processors are equipped with extra hardware which allows monitoring of specific, performance related hardware events at run-time. These events characterize the interaction of applications with the hardware and the event counters provide a relatively cost-effective mechanism to attain such information.

            Typical events are related to the memory subsystem performance, the execution units utilization, branch predictions, resource stalls, etc. Traditionally, the hardware event counters have been used for off-line manual or automatic code restructuring and optimization. In the context of this paper we use them in a completely different way.

         We exploit them at run-time, in order to dynamically identify the characteristics of the execution environment and use this knowledge to reach educated scheduling decisions. On Intel processors equipped with Hyper threading capabilities, a single set of performance counters and the corresponding configuration registers are shared between two execution contexts. As a consequence, conflicts may arise if both threads on the same processor happen to request access to the same counter or configuration register.

           In order to avoid such conflicts, the operating system binds threads that have activated performance counter monitoring to the first execution context of each physical processor. In other words, at any time snapshot, only one execution context on each processor may accommodate threads that use performance counters. In order to overcome this limitation, we have used two sets of counters for each event we measure.

            Both sets are activated by the processor manager on the thread which is specified to run on the first execution context of each physical processor during the next quantum. At the same time, we take advantage of a bitmask in the configuration registers which allows the programmer to distinguish between the events triggered by the thread running on each execution context. More specifically, in our case the first counter set is used to measure events caused by the thread running on the first execution context of the processor, whereas the second counter set measures events caused by the thread on the second execution context. At the end of time quanta, the processor manager reads the values from both counter sets.

            Given that the processor manager has accurate knowledge and complete control on the association between execution contexts and threads during any quantum, the values read from each counter set can be attributed to the corresponding thread. To the best of our knowledge, this is the first implementation which successfully deals with the problem of sharing performance monitoring hardware between threads on Intel Hyper threaded processors. Furthermore, our solution does not require any modification to kernel code.

SEMANTIC AND NATURAL EXECUTION ORDER

            The semantic is lexicographic: statements are lexicographically ordered by ’;’. The value read from a parameter with shared r access specification is the last written value according to the lexicographic order. This order is called the reference order. This reference order is important for the execution: if only one processor is available for execution, then the execution of closures following this reference order is correct and does not require to compute the state of the closure objects and access objects of the data flow graph. This case corresponds to the standard execution, when no processor is idle. Next section deals with scheduling when a processor becomes idle.

Post a Comment

3 Comments