Continued
Prior research on scheduling algorithms that take into consideration the interaction between threads or processes in processors is closely related to the work presented in this paper. Several seminal works on cache affinity scheduling used simulation, physical experimentation and analytical models to study the impact of time-sharing on cache performance.
Heuristics for estimating cache
affinity by measuring the time between consecutive executions of the same
thread on the same processor are used in practically all shared-memory
multiprocessor schedulers. Thread scheduling heuristics using accurate
information on cache misses from the hardware have also been proposed for standalone
multithreaded programs.
The advent of simultaneous
multithreading has stimulated research in all aspects of operating system
support for processors using this technique. Symbiotic job scheduling, an idea
proposed originally for scheduling on the Tera MTA architecture, has been proposed
as a viable way for factoring the implications of sharing hardware resources
between threads into the scheduling process. Symbiotic job scheduling relies on
sampling the performance of different mixes of co-scheduled jobs, using
different types of performance metrics, until a consensus on the best pairing
is reached. This scheduling strategy may reach optimality for certain workloads
but is very sensitive to their dynamic characteristics.
The number of possible thread
pairings grows quickly with the number of threads (or single-threaded jobs) in
the workload and the number of execution contexts on the processor. The
performance samples tend to get obsolete as jobs come and go, or when jobs move
through different phases of their computation. The complexity grows even
further if one needs to consider multiple hardware metrics to infer the
interference in a given workload. Real processors can not support the
simultaneous counting of many hardware events, therefore statistical sampling
and averaging are necessary to gather all the information.
This paper takes a different
approach. Starting from a schedule selected to reduce interference between
processors on the shared bus, we use hardware metrics to converge to a schedule
that reduces thread interference within each processor. Hardware mechanisms
proposed to improve the quality of service to threads in terms of instruction
throughput relate closely to job scheduling issues and the policies we
introduced in this paper, as they attempt to alleviate imbalances of service
between threads by regulating the number of instruction issue slots used by
each thread. The interference between threads in the shared cache has also been
a focal point of investigation for multithreaded processors and chip
multiprocessors.
The problem presents several
interesting challenges from the architectural perspective, but with respect to
job scheduling, the most common technique used to alleviate thread interference
is to partition the cache between processors or execution contexts. Most of
these works propose static cache partitioning schemes, however dynamic schemes
have also been proposed recently. There is a consensus among these works that
cache partitioning tends to be beneficial for job scheduling, however the
problem of identifying the right static or dynamic partitions to use in the
cache is difficult.
PERFORMANCE MONITORING
THE CASE OF INTEL HYPER THREADED PROCESSORS
Most modern processors are equipped
with extra hardware which allows monitoring of specific, performance related
hardware events at run-time. These events characterize the interaction of
applications with the hardware and the event counters provide a relatively
cost-effective mechanism to attain such information.
Typical
events are related to the memory subsystem performance, the execution units
utilization, branch predictions, resource stalls, etc. Traditionally, the
hardware event counters have been used for off-line manual or automatic code
restructuring and optimization. In the context of this paper we use them in a
completely different way.
We exploit them at run-time, in order to dynamically identify the characteristics of the execution environment and use this knowledge to reach educated scheduling decisions. On Intel processors equipped with Hyper threading capabilities, a single set of performance counters and the corresponding configuration registers are shared between two execution contexts. As a consequence, conflicts may arise if both threads on the same processor happen to request access to the same counter or configuration register.
In order to avoid such conflicts, the operating system binds threads that have activated performance counter monitoring to the first execution context of each physical processor. In other words, at any time snapshot, only one execution context on each processor may accommodate threads that use performance counters. In order to overcome this limitation, we have used two sets of counters for each event we measure.
Both
sets are activated by the processor manager on the thread which is specified to
run on the first execution context of each physical processor during the next
quantum. At the same time, we take advantage of a bitmask in the configuration
registers which allows the programmer to distinguish between the events
triggered by the thread running on each execution context. More specifically,
in our case the first counter set is used to measure events caused by the
thread running on the first execution context of the processor, whereas the
second counter set measures events caused by the thread on the second execution
context. At the end of time quanta, the processor manager reads the values from
both counter sets.
Given
that the processor manager has accurate knowledge and complete control on the
association between execution contexts and threads during any quantum, the
values read from each counter set can be attributed to the corresponding
thread. To the best of our knowledge, this is the first implementation which
successfully deals with the problem of sharing performance monitoring hardware
between threads on Intel Hyper threaded processors. Furthermore, our solution
does not require any modification to kernel code.
SEMANTIC AND NATURAL EXECUTION ORDER
The semantic is lexicographic:
statements are lexicographically ordered by ’;’. The value read from a
parameter with shared r access specification is the last written value
according to the lexicographic order. This order is called the reference order.
This reference order is important for the execution: if only one processor is
available for execution, then the execution of closures following this
reference order is correct and does not require to compute the state of the
closure objects and access objects of the data flow graph. This case
corresponds to the standard execution, when no processor is idle. Next section
deals with scheduling when a processor becomes idle.
3 Comments
Very informative
ReplyDeletehyper thread processing is explained fantastically
ReplyDeletefantastic information
ReplyDelete