Posts Tagged ‘instruction level parallelism’

Summary of Instruction Level Parallelism Limitation

December 5, 2012

Limitation of ILP

1. Code is not mostly parallelizable. Upper limit on parallelization according to Amdahl’s law. ILP speedup may need to be in the 5 to 20 factor to be accepted generally given the VLIW processor complexity that needs to be introduced to achieve that [A]. During simulations, even with increased hardware available to processor, programs did not achieve a corresponding linear increase in performance by exploiting ILP [C].

2. Data dependency. Either stall and waste resources (and have a simple hardware), or:
2.1. Data speculation [B] – complexity of handling predictions (serialized and non-serialized methods), handling spurious exceptions, memory disambiguation/aliasing, and misprediction undoing penalty.
2.2. Forwarding – does not completely avoid stalling, and increases hardware complexity.
2.3. Instruction reordering (compiler and/or hardware) [C]. This technique is used to remove data dependency mostly originated by waiting for a memory operation to finish (memory operations account for 30-40% of total code [C]). This approach introduces hardware complexity such as reorder buffers to allow out-of-order execution but in-order retirement of instructions [C]. Memory delays could also be shortened through the use of I/D caches. However, those present their own sets of challenges and limitations (limited size, enforcing data coherency across processor caches, managing efficient cache replacement policies overhead, etc.) Generally speaking, there are many ways to accomplish instruction re-ordering:
2.3.1. Compilers can tag instructions with dependency flags (all instructions this current instruction is dependent on) such as in dependence processors (Dataflow). This can also be accomplished by the processor itself without the help of the compiler such as in sequential (superscalar) processors (although they may also use suggestions from the compiler but will not guarantee using those suggestions) [A].
2.3.2. Compilers tag how many instructions within the last M instructions that this current instruction is dependent on (such as in Horizon processor) [A].
2.3.3. Compilers can group instructions together in traces (this will include moving and duplicating instructions in various basic blocks of execution to allow executing (early) instructions as part of a basic block that would definitely be executed regardless of control decisions between. More code size, but higher performance overall.
2.3.4. Compilers can use register renaming and loop unrolling to remove data dependency across iterations, and speed up execution (sending separate loops to execute in parallel), this is referred to as software pipelining [A]. This adds a trade-off between how many loops to get higher throughput versus increasing code size (many loops) when some of those loops may end up unnecessary (iterations end earlier than unrolled code). Software pipelining goes beyond loop unrolling, it also brings code not dependent on looping outside of the loop (some to the top [prologue] and some to the bottom [epilogue] of the middle which is truly iterable code [kernel]), and then the true iterable code will be re-rolled. This type of compiler scheduling is called Modulo scheduling [A]. This approach can also cause register spilling (more registers needed than absolutely necessary for program execution), condition prediction (we need to speculate on whether the loop would be executed at least one more time to unroll – static prediction), true dependencies in data used in iteration i after being written in iteration < i, memory aliasing issues (will some pointers in one iteration write to the same address as in subsequent iterations?), etc.

3. Control dependency. Either stall and waste resources, or:
3.1. Branch prediction – complexity of handling predictions, handling spurious exceptions, and misprediction reversal (reservation station, instruction caches, and instruction commits are used to allow this).
3.2. Simultaneous branch execution (branch fanout) – more functional units, more registers, waste of resources that otherwise could have been used somewhere else, more management of what gets committed, and what gets to be discarded, etc.
3.3. Compilers and hardware working together to add more delay slot instructions, reordering of instructions.

4. Optimizing and parallelizing programs depend on knowing which basic blocks (or portions of the code) would run more frequently than others – execution frequency of basic blocks [A]. This can be achieved via analysis of the control flow graph of the code through the use of profilers. This is not a straightforward process, and it is highly dependent on the workload at runtime.

5. Basic block ILP limitation [A] can be mostly attributed to (aside from data dependency already mentioned above) limitations of functional hardware units due to similar operations within the block (albeit independent). For example, if the processor has 5 ALU units available to execute 5 adds in parallel (same cycle), but a basic block has 6 independent adds, then we need two cycles to execute instead of one. That is why VLIW instructions will include differing operations that can execute in the same cycle rather than many of the same operation (dependent on what is available in the hardware). Furthermore, ILP could be limited to 3-4 parallelizable instructions out of around 10 in a basic block (high limit) [F]. This is just a limitation of parallelism in the code itself.

6. Naming dependency (except for true dependencies). This can be resolved with register [C] and memory (except for ones with alias problem potential) renaming which is usually done by the compiler [B]. It is still limited by register availability and management (besides potentially introducing register spilling issues, we may also run into bandwidth issues in register files due to more reads and writes on the expanded list of registers).

7. A big reason why code is hard to parallelize is address calculations [D]. Eliminating or reducing long dependency chains of address calculations via compiler optimized code is seen to increase ILP. The paper [D] refers to some techniques in other resources to address addressing computation optimization, so I will have to do more reading.

8. Most of the data dependency that limits ILP comes from true dependencies (read after write dependencies) because the other two types (anti-dependencies [write after read] and output dependencies [write after write]) can be managed with renaming [E]. Those true dependencies come primarily from compiler-induced optimizations to support the high level language abstractions [E]. The compiler will introduce a high usage of the stack to reflect activation records corresponding to function calls and private variables, introducing a true dependency that would significantly reduce parallelization [E]. The true limitation is not based on the time it takes to allocate or deallocate the stack pointer register for an activation record, but based on the long chain of dependencies introduced [E]. [E] shows that even if all dependencies were completely eliminated, leaving only those that update the stack pointer, the total performance gained is nearly unchanged (controlling the absolute limit for achieving ILP). Removing stack update dependencies, however, have shown to provide significant performance gains even compared to perfect prediction and renaming – use heap based activation record allocation instead of stack allocation (accept higher overhead of allocating to enable multi-threading and true parallel execution of program traces). Other suggestions may include the use of multiple stacks or switching between stack-based versus heap-based at compile time based on the depth of the stack calling chain (the deeper the call stack, the more benefit gained from using heap-based activation record allocation) [E]. Some papers show that by increasing the window size, more parallelism can be exploited, while [E] shows that while that may be true, “distant” dependencies (beyond the window size) cannot be exploited with out-of-order instruction issue by superscalars, and other methods are needed under a reasonable window size limitation.

Things to look for or read about

1. Window size adjustments. How about compiler-controlled variable sizes?

2. Perfecting branch predictors, very important since it has a major impact on exploiting ILP. Most papers I read have done simulations under unreasonable assumptions such as perfect prediction, unlimited resources, unlimited bandwidth, unreasonably low penalty cost for mispredictions, ignoring spurious exceptions, etc.

3. Handling memory aliases at compiler time.

4. Choosing stack based or heap based activation record allocation at compile time. Maybe even considering multiple stacks – addresses true dependencies introduced by the compiler via deep chain of function call dependencies. A major performance increase can be gained here.

5. Clock rate per operation variation to increase throughput for faster operations. This can potentially increase a low ceiling on potential throughput even for embarrassingly parallel operations on the same CPU.

6. Generation of per-thread-based traces by the compiler, taking into account shared versus dedicated on chip memory, possible proximity to shared caches, etc.

7. Can traces be concurrent rather than parallel? Allowing for concurrent execution rather than parallel execution (allowing values to forward or be written to shared caches rather than waiting for a complete trace to finish before another one to start even on separate cores).

8. Maybe enforce convention by the compiler to allow predictable address fanout (range of memory address) for given functions or programs. For example, for dynamically allocated objects, the compiler may enforce via hints to the hardware how far apart they need to be on the heap, which will allow the hardware to take advantage of locality when loading a cache line from memory. Those can only be hints due to memory allocation and page replacement strategies, but a cooperation from the hardware and hints from the software can increase this utilization.

9. Exploit the nature of sequence analysis algorithms to optimize performance.

10. A hybrid processor approach to realize ILP and DLP (combining VLIW/superscalar and vector processors).

Sources

A. Instruction-Level Parallel Processing. Joseph A. Fisher, B. Ramakrashna Rau.
B. Limits of Instruction Level Parallelism with Data Value Speculation. Jose Gonzalez and Antonio Gonzalez.
C. Exploiting Instruction- and Data-Level Parallelism. Roger Espasa and Mateo Valero.
D. Limits and Graph Structure of Available Instruction-Level Parallelism. Darko Stevanovic and Maragaret Martonosi.
E. The Limits of Instruction Level Parallelism in Spec95 Applications. Matthew A. Postiff, David A. Greene, Gary S. Tyson, and Trevor N. Mudge.
F. Limits of Instruction-Level Parallelism. David W. Wall.