A Practical Stride Prefetching Implementation in Global Optimizer
26 Slides380.50 KB
A Practical Stride Prefetching Implementation in Global Optimizer Hucheng Zhou, Xing Zhou Tsinghua University 08/12/23 1
Outline Introduction Motivation Algorithm Phase Ordering Prefetching Scheduling Experiments Future Work 08/12/23 A Practical Stride Prefetching Implementation in Global Optimizer 2
Introduction What’s data prefetching Brings data into cache ahead of its use Compiler controlled prefetching Prefetching candidates identification Prefetching timing determination Unnecessary prefetching elimination Other prefetching tuning 08/12/23 A Practical Stride Prefetching Implementation in Global Optimizer 3
Introduction Stride data prefetching Massive consecutive memory references Cause to many cache misses, thus poor performance Our focus Compiler based stride data prefetching 08/12/23 A Practical Stride Prefetching Implementation in Global Optimizer 4
Motivation Dominant stride prefetching algorithm Loop Nest Optimizer (LNO) based LNO based algorithm Locality analysis (reuse analysis localized iteration space prefetching predicates) Loop splitting (loop peeling and unrolling) Scheduling prefetches (iterations ahead of use) Limitations of LNO based approach Observations 08/12/23 A Practical Stride Prefetching Implementation in Global Optimizer 5
LNO based algorithm Example: (a). Original loop: for (int i 0; i 1000; i) for (int j 0; j 1000; j) A[i][0] j; (b). After LNO-based prefetching: for (int i 0; i 5; i) // 5 iters ehead prefetch(&A[i][0]); // prologue for (int i 5; i 995; i) prefetch(&A[i 5][0]); // peeling, steady state for (int j 0; j 1000; j) A[i][0] j; for (int i 995; i 1000; i) for (int j 0; j 1000; j) A[i][0] j; // epilogue 08/12/23 A Practical Stride Prefetching Implementation in Global Optimizer 6
Limitations Only effective for affine array reference Only handle with DO loop nest Due to the vector space model Just focus on numerical applications operate on dense matrices However, not all of the strided references are affine array references, such as c STL vector traversing and other wrap around data structures 08/12/23 A Practical Stride Prefetching Implementation in Global Optimizer 7
Necessity Four common ways of STL vector traversing typedef double Type; #define LEN 10000 vector Type V(LEN); Type sum; #define ACCESS1 {\ sum 0.0; \ for(int i 0; i LEN; i ) { \ sum V[i]; \ } \ } #define ACCESS2 {\ sum 0.0; \ vector Type ::iterator it;\ for(it V.begin(); it ! V.end(); it ) { \ sum *it; } \ } #define ACCESS3 {\ sum 0.0; \ for(int i 0; i LEN; i ) { \ sum V.at(i); \ } \ } #define ACCESS4 {\ sum 0.0; \ vector Type ::const iterator cit; \ for(cit V.begin(); it ! V.end(); it ) { \ sum *cit; } \ } 08/12/23 A Practical Stride Prefetching Implementation in Global Optimizer 8
The Component flow of Open64 Source Code in C/C /FORTRAN FE Front End Very High level WHIRL PRE-OPT IPA Inter-procedural Analysis High Level WHIRL PRE-OPT LNO Loop Nest Optimizer High Level WHIRL Lower Arrays Lower high level control flows PRE-OPT WOPT CG Global Optimizer Middle Level WHIRL Code Generator Object Code 08/12/23 A Practical Stride Prefetching Implementation in Global Optimizer 9
IR after PRE-OPT For ACCESS1 and ACCESS2 ACCESS1: // start &(*V.begin()); // finish &(*V.end()); sum 0; anon2 ( IEEE64 *)start; for(anon1 0; anon1 9999 ; anon1 anon1 1) { anon2 anon2 anon1 * 8; sum *anon2 sum; } ACCESS2: annon 1 ( IEEE64 *)start; annon 2 ( IEEE64 *)finish; while(anon1 ! anon2) { sum *anon1 sum; anon1 anon1 8; } 08/12/23 A Practical Stride Prefetching Implementation in Global Optimizer 10
Compare with array references Source Code double A[1000]; for (int i 0; i 1000; i i 2) A[i] A[i -1] 5; WHIRL after PRE-OPT IEEE64 anon1[1000LL]; for(anon3 0; (anon3 * 2) 999; anon3 anon3 1) anon1[anon3 * 2] anon1[(anon3 * 2) -1] 5.0; 08/12/23 A Practical Stride Prefetching Implementation in Global Optimizer 11
Comparison LNO based approach exploits the tight affinity with locality analysis and vector space model to identify the prefetching candidates which suffer from cache misses However, this affinity limits itself only for affine array references, cannot handle STL style stride references From another angle, identify stride prefetching candidates as induction variable recognition, then exploit the phase ordering to avoid unnecessary prefetches 08/12/23 A Practical Stride Prefetching Implementation in Global Optimizer 12
Definitions and Observations A linear inductive variable (expression) is an expression whose value is incremented by a nonzero integer loop invariant on every i terations Lemma 1: linear inductive expression can be recursively defined: If v is a linear induction variable with stride s, then i is a linear inductive expression with the same stride s; If expr is a linear inductive expression with stride s, then –expr is a linear inductive expression with the same stride -s; If expr is linear inductive expression with stride s and invar is a loop inva riant, then expr invar and invar expr are all inductive expressions wi th stride s; If expr1 and expr2 are linear inductive expressions with stride s1 and s2 respectively, then expr1 expr2 is a linear inductive expression with stri de s1 s2; If expr is linear inductive expression with stride s and invar is a loop inva riant, then expr * invar and invar * expr are all inductive expressions wit h stride invar * s; If expr is linear inductive expression with stride s and invar is a loop inva riant, then expr / invar is a linear inductive expression with stride s/invar. 08/12/23 A Practical Stride Prefetching Implementation in Global Optimizer 13
Definitions and Observat ions So, Mathematically, it equals to the linear combin ation of linear induction variables and loop invari ants, with the form: E c1* i1 c2*i2 cn*in invar, where stride value s *c i s n i i i 1 A stride reference is the reference in a loop wh ose accessed memory address is incremented by a integer loop invariant on every iterations Lemma 2: If a reference in loop whose accessed memory address is represented as an inductive e xpression, then it is a stride reference 08/12/23 A Practical Stride Prefetching Implementation in Global Optimizer 14
Speculative Induction Variabl e Recognition for Stride Pref etching Thus stride reference identification equals to induction expression recognition We have presented an algorithm for demand driven speculative recognition of induction expression 08/12/23 A Practical Stride Prefetching Implementation in Global Optimizer 15
Speculative Induction Variable Recognition for Stride Induction variablesPrefetching in SSA form must satisfy the following condition : there must be a live phi in the corresponding loop header BB among the two operands of the phi, the loop invariant operand must point to the initialization of the induction variable out of the loop, while the other operand must be defined within the loop body. We call them init and increment respectively After expanding the increment operand of phi by copy propagation, the expanding result must contain the result of that phi, with a loop invariant expression as stride of the induction variable 08/12/23 A Practical Stride Prefetching Implementation in Global Optimizer v1 ; v2 phi (v1, v4) v3 rhs1; v4 rhs2; 16
Our algorithm 08/12/23 A Practical Stride Prefetching Implementation in Global Optimizer 17
08/12/23 A Practical Stride Prefetching Implementation in Global Optimizer 18
Comparison Traditional induction variable recognition Equals to strongly connected component Just for variable Conservative due to alias Limitations of copy propagation Our algorithm Demand driven Symbolic interpretation Speculative determination Modify a few on the expansion process of current im plementation 08/12/23 A Practical Stride Prefetching Implementation in Global Optimizer 19
Phase Ordering Implement our algorithm after SSAPRE will benefit from strength reduction and PRE optimizations int A[100][100]; for (int i 0; i 100; i) for (int j 0; j 100; j) A[i][0] j; (a) preg2 0; loop 1: preg1 *(&A preg2); loop 2: preg1 preg1 j; j j 1; if (j 100) goto loop 2; preg2 preg2 400; i i 1; if (i 100) goto loop 1; (d) 08/12/23 loop 1: loop 2: *(&A i * 400) *(&A i * 400) j; j j 1; if (j 100) goto loop 2; i i 1; if (i 100) goto loop 1; loop 1: preg1 *(&A i * 400); loop 2: preg1 preg1 j; j j 1; if (j 100) goto loop 2; i i 1; if (i 100) goto loop 1; (b) Preg3 &A; preg2 0; loop 1: preg1 *preg3; loop 2: preg1 preg1 j; j j 1; if (j 100) goto loop 2; preg2 preg2 400; preg3 preg3 400; i i 1; if (i 100) goto loop 1; (c) preg3 &A; loop 1: preg1 *preg3; loop 2: preg1 preg1 j; j j 1; if (j 100) goto loop 2; preg3 preg3 400; if (preg3 A 40000) goto loop 1; (e) A Practical Stride Prefetching Implementation in Global Optimizer (f) 20
Prefetching Scheduling Leading reference determination Prefetching information collection Prefetching determination for the candidates Based on the heuristics, such as data and loop size as well as the number of prefetches in the loop Computation of prefetching distance Stride value, data/loop shape, target cache model division of memory latency and the estimated time per iteration Loop transformations based on locality information to further reduce the number of prefetches 08/12/23 A Practical Stride Prefetching Implementation in Global Optimizer 21
Experiments We have conducted experiments agains t SPEC2006 benchmark on IA64 Itanium 2 Madison 1.6GHz with 6MB L3 cac he and 8 GBytes memory quad-processor server with RedhatLinux Ad vanced Server 4.0 compiler is Open64 4.1 08/12/23 A Practical Stride Prefetching Implementation in Global Optimizer 22
Normalized results of SPEC2006 FP 08/12/23 A Practical Stride Prefetching Implementation in Global Optimizer 23
Normalized results of SPEC2006 INT 08/12/23 A Practical Stride Prefetching Implementation in Global Optimizer 24
Conclusion and Future Work we propose an alternative inductive data prefetching algorithm implemented in global optimizer at O2 level, which can in theory prefetch almost all of the stride references statically determined in compile time extend to prefetch periodic, polynomial, geometric, monotonic and wrap-around variables Totally integrated stride prefetching algorithm with strength reduction optimization in SSAPRE coordinate the data prefetch with data layout optimization further investigate the interaction between software and hardware prefetching according to the static compiler analysis and feedback information on X86 platform 08/12/23 A Practical Stride Prefetching Implementation in Global Optimizer 25
Thanks Thank you very much And any questions? 08/12/23 A Practical Stride Prefetching Implementation in Global Optimizer 26