6.004-16, Cache design issues

Add to Favourites
Post to:

MIT OpenCourseWarehttp://ocw.mit.edu For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. 6.004 Computation Structures Spring 2009 L16 – Cache Issues 1 6.004 – Spring 20094/7/09Cache Issues Cache miss Why the book? Lecture notes are faster! modified 4/6/09 14:12 Quiz #3 Friday! L16 – Cache Issues 2 6.004 – Spring 20094/7/09Basic Cache Algorithm ON REFERENCE TO Mem[X]:Look for X among cache tags... HIT:X = TAG(i) , for some cache line i •READ:return DATA(i) •WRITE:change DATA(i); Start Write to Mem(X) MISS:X not found in TAG of any cache line •REPLACEMENT SELECTION: Select some line k to hold Mem[X] (Allocation) •READ:Read Mem[X] Set TAG(k)=X, DATA(K)=Mem[X] •WRITE:Start Write to Mem(X) Set TAG(k)=X, DATA(K)= new Mem[X]MAINMEMORYCPU(1!)Tag Data ABMem[A]Mem[B]L16 – Cache Issues 3 6.004 – Spring 20094/7/09Cache Design Issues Associativity – a basic tradeoff between •Parallel Searching (expensive) vs•Constraints on which addresses can be stored where Replacement Strategy: •OK, we’ve missed.Gotta add this new address/value pair to the cache. What do we kick out? Least Recently Used: discard the one we haven’t used the longest. Plausible alternatives, (e.g. random replacement. Block Size: •Amortizing cost of tag over multiple words of data Write Strategy: •When do we write cache contents to main memory? L16 – Cache Issues 4 6.004 – Spring 20094/7/09Associativity…01Fully Associative -expensive! -flexible: any address can be cached in any line Lots... ... or NONE! Direct Mapped -cheap (ordinary SRAM) -contention: addresses compete for cache lines L16 – Cache Issues 5 6.004 – Spring 20094/7/09Loop B: Pgm at 1024, dataat2048: … but not here! Loop A: Pgm at 1024, dataat 37: Works GREAT here… Direct-Mapped Cache Contention Assume 1024-line direct-mapped cache, 1 word/line. Consider tight loop, at steady state: (assume WORD, not BYTE, addressing) Memory Address 1024 37 1025 38 1026 39 1024 ... 1024 2048 1025 2049 1026 2050 1024 ... Cache Line037 138 239 00011220Hit/Miss HIT HIT HIT HIT HIT HIT HIT MISS MISS MISS MISS MISS MISS MISS We need some associativity, But not full associativity… L16 – Cache Issues 6 6.004 – Spring 20094/7/09Fully-assoc. vs. Direct-mapped Fully-associative N-line cache: •N tag comparators, registers used for tag/data storage ($$$) •Location A might be cached in any one of the N cache lines; no restrictions! •Replacement strategy (e.g., LRU) used to pick which line to use when loading new word(s) into cache •PROBLEM: Cost! Direct-mapped N-line cache: •1 tag comparator, SRAM used for tag/data storage ($) •Location A is cached in a specific line of the cache determined by its address; address “collisions” possible •Replacement strategy not needed: each word can only be cached in one specific cache line •PROBLEM: Contention! L16 – Cache Issues 7 6.004 – Spring 20094/7/09Cost vs Contention two observations...1. Probability of collision diminishes with cache size... ... so lets build HUGE direct-mapped caches, using cheap SRAM! 2. Contention mostly occurs between independent “hot spots” --•Instruction fetches vs stack frame vs data structures, etc •Ability to simultaneously cache a few (2? 4? 8?) hot spots eliminates most collisions ... so lets build caches that allow each location to be stored in some restricted set of cache lines, rather than in exactly one (direct mapped) or every line (fully associative). Insight: an N-way set-associativecache affords modest parallelism •parallel lookup (associativity): restricted to small set of N lines •modest parallelism deals with most contention at modest cost •can implement using N direct-mapped caches, running in parallel L16 – Cache Issues 8 6.004 – Spring 20094/7/09N-way Set-Associative Cache kHITDATA TO CPUINCOMING ADDRESS=?=?=?t0MEM DATAN direct-mapped caches, each with 2t lines Simultaneously addressed line in each subcache constitutes a setL16 – Cache Issues 9 6.004 – Spring 20094/7/09Associativitythe birds-eye viewCan place caches in 2D space: •Total lines = # Sets * Set Size •# Sets = 1: Fully Associative •Set Size = 1: Direct Mapped •Set Size = N: N-way Set Associative # Sets Set Size “set” L16 – Cache Issues 10 6.004 – Spring 20094/7/09ISSUE: Replacement Strategy Associativity implies choices… address Fully associative •compare addr with each tag simultaneously •location A can be stored in any cache line address Direct-mapped •compare addr with only one tag •location A can be stored in exactly one cache line Naddress N-way set associative •compare addr with N tags simultaneously •location A can be stored in exactly one set, but in any of the N cache lines belonging to that set L16 – Cache Issues 11 6.004 – Spring 20094/7/09Replacement Strategy LRU (Least-recently used) •keeps most-recently used locations in cache •need to keep ordered list of N items N! orderings O(log2N!) = O(N log2N) “LRU bits” + complex logic (0,1,2,3) Hit 2 -> (2,0,1,3) (2,0,1,3) Hit 1 -> (1,2,0,3) (1,2,0,3) Miss -> (3,1,2,0)(3,1,2,0) Hit 3 -> (3,1,2,0) Overhead is O(N log2N)bits/setOverhead is O(log2N)bits/setOverhead is O(log2N)bits/cache! FIFO/LRR (first-in, first-out/least-recently replaced) •cheap alternative: replace oldest item (dated by access time) •within each set: keep one counter that points to victim line Random (select replacement line using random, uniform distribution) •no “pathological” reference streams causing wost-case results •use pseudo-random generator to get reproducible behavior; •use real randomness to prevent reverse engineering! L16 – Cache Issues 12 6.004 – Spring 20094/7/09Cache Benchmarking Supposethis loop is entered with R3=4000:ADR: InstructionID400: LD(R3,0,R0) 400 4000+... 404: ADDC(R3,4,R3) 404 408: BNE(R0,400) 408GOAL: Given some cache design, simulate (by hand or machine) execution well enough to determine hit ratio. 1. Observe that the sequence of memory locations referenced is 400, 4000, 404, 408, 400, 4004, ... We can use this simpler reference string, rather than the program, to simulate cache behavior. 2. We can make our life easier in many cases by converting to wordaddresses: 100, 1000, 101, 102, 100, 1001, ... (Word Addr = (Byte Addr)/4) L16 – Cache Issues 13 6.004 – Spring 20094/7/09Cache Simulation Addr Line# Miss? 100 0 M 1000 1 M 101 2 M 102 3 M 100 0 1001 1 M 101 2 102 3 100 0 1002 1 M 101 2 102 3 100 0 1003 1 M 101 2 102 3 4-line Fully-associative/LRU 1/4 miss Addr Line# Miss? 100 0 M 1000 0 M 101 1 M 102 2 M 100 0 M 1001 1 M 101 1 M 102 2 100 0 1002 2 M 101 1 102 2 M 100 0 1003 3 M 101 1 102 2 4-line Direct-mapped 7/16 miss Compulsory MissesCollision Collision missL16 – Cache Issues 14 6.004 – Spring 20094/7/09Associativity: Full vs 2-way Addr Line# Miss? 100 0 M 1000 1 M 101 2 M 102 3 M 100 0 1001 4 M 101 2 102 3 100 0 1002 5 M 101 2 102 3 100 0 1003 6 M 101 2 102 3 8-line Fully-associative, LRU 1/4 miss Addr Set#/N Miss? 100 0,0 M 1000 0,1 M 101 1,0 M 102 2,0 M 100 0,0 1001 1,1 M 101 1,0 102 2,0 100 0,0 1002 2,1 M 101 1,0 102 2,0 100 0,0 1003 3,1 M 101 1,0 102 2,0 2-way, 8-line total, LRU 1/4 miss L16 – Cache Issues 15 6.004 – Spring 20094/7/09024681012141k2k4k8k16k32k64k128k1-way2-way4-way8-wayfullyassoc.Associativity vs. miss rate Miss rate (%)Cache size (bytes) Associativity H&P: Figure 5.9 •8-way is (almost) as effective as fully-associative •rule of thumb: N-line direct-mapped == N/2-line 2-way set assoc. (direct-mapped) L16 – Cache Issues 16 6.004 – Spring 20094/7/09Devil’s Advocacy Games Your company uses the cheaper FIFO cache, the competition uses LRU. Can you devise a benchmark to make your cache look better?Assume 0x100 sets, 2-way… 2-way, LRU 2-way, FIFO Set 0 tags: Adr Set, # H/MSet, # H/M1000, 0 M0, 0 M10000, 1 M0, 1 M1000, 0 H0, 0 H20000, 1 M0, 0 M10000, 0 M0, 0 H#0#1#0#1A carefully-designed benchmark can make either look better…Pessimal case: next adr referenced is the one just replaced! Random replacement makes this game harder… 100100100010001000200020001000100200020001000BINGO!L16 – Cache Issues 17 6.004 – Spring 20094/7/09Increasing Block SizeMore Data/Tag A31:4Mem[A]Mem[A+4]Mem[A+8]Mem[A+12]= ? [3:2][31:4]32ADDRDATA HIT•blocks of 2B words, on 2B word boundaries •always read/write 2B word block from/to memory •locality: access on word in block, others likely •cost: some fetches of unaccessed words BIG WIN if there is a wide path to memory Enlarge each line in cache: TAG D0D1D2D34 x 32 = 128 bits 28 bits Overhead < ¼ bit of Tag per bit of data L16 – Cache Issues 18 6.004 – Spring 20094/7/094-word block, DM Cache = ? [3:2][31:8]32ADDRDATA HITTAG D0D1D2D3[7:4]Use ordinary (fast) static RAM for tag and data storage Only one comparator for entire cache! 01516 cache lines 4 bit index 0x12M[0x1230] M[0x1234] M[0x1238] M[0x123C] 0x12M[0x1240] M[0x1244] M[0x1248] M[0x124C] 24-bit Tag! L16 – Cache Issues 19 6.004 – Spring 20094/7/09Valid bits V1100000MAINMEMORYCPUAMem[A]BMem[B]TAGDATAProblem: •Ignoring cache lines that don't contain anything of value... e.g., on start-up “Back door” changes to memory (eg loading program from disk) Solution: •Extend each TAG with VALID bit.•Valid bit must be set for cache line to HIT. •At power-up /reset : clear all valid bits •Set valid bit when cache line is first replaced.•Cache Control Feature: Flush cache by clearing all valid bits, Under program/external control. L16 – Cache Issues 20 6.004 – Spring 20094/7/09Handling of WRITES Observation: Most (90+%) of memory accesses are READs. How should we handle writes? Issues: Write-through: CPU writes are cached, but also written to main memory (stalling the CPU until write is completed). Memory always holds “the truth”.Write-behind: CPU writes are cached; writes to main memory may be buffered, perhaps pipelined. CPU keeps executing while writes are completed (in order) in the background. Write-back: CPU writes are cached, but not immediately written to main memory. Memory contents can be “stale”. Our cache thus far uses write-through. Can we improve write performance? L16 – Cache Issues 21 6.004 – Spring 20094/7/09Write-throughON REFERENCE TO Mem[X]: Look for X among tags... HIT: X == TAG(i) , for some cache line i •READ:return DATA[I] •WRITE:change DATA[I]; Start Write to Mem[X] MISS:X not found in TAG of any cache line •REPLACEMENT SELECTION: Select some line k to hold Mem[X] •READ:Read Mem[X] Set TAG[k] = X, DATA[k] = Mem[X] •WRITE:Start Write to Mem[X] Set TAG[k] = X, DATA[k] = new Mem[X] L16 – Cache Issues 22 6.004 – Spring 20094/7/09Write-backON REFERENCE TO Mem[X]: Look for X among tags... HIT: X = TAG(i) , for some cache line I •READ:return DATA(i) •WRITE:change DATA(i); Start Write to Mem[X] MISS:X not found in TAG of any cache line •REPLACEMENT SELECTION: Select some line k to hold Mem[X] Write Back: Write Data(k) to Mem[Tag[k]]•READ:Read Mem[X] Set TAG[k] = X, DATA[k] = Mem[X] •WRITE:Start Write to Mem[X] Set TAG[k] = X, DATA[k] = new Mem[X] Is write-back worth the trouble? Depends on (1) cost of write; (2) consistency issues. L16 – Cache Issues 23 6.004 – Spring 20094/7/09Write-back w/“Dirty” bits ON REFERENCE TO Mem[X]: Look for X among tags... HIT: X = TAG(i) , for some cache line I •READ:return DATA(i) •WRITE:change DATA(i); Start Write to Mem[X] D[i]=1MISS:X not found in TAG of any cache line •REPLACEMENT SELECTION: Select some line k to hold Mem[X] If D[k] == 1 (Write Back) Write Data(k) to Mem[Tag[k]]•READ:Read Mem[X]; Set TAG[k] = X, DATA[k] = Mem[X], D[k]=0•WRITE:Start Write to Mem[X] D[k]=1Set TAG[k] = X, DATA[k] = new Mem[X] MAINMEMORYCPUAMem[A]BMem[B]TAGDATAV1100000D10L16 – Cache Issues 24 6.004 – Spring 20094/7/09Caches: Summary Associativity:•Less important as size increases •2-way or 4-way usually plenty for typical program clustering; BUT additional associativity •Smooths performance curve •Reduces number of select bits (we’ll see shortly how this helps) •TREND: Invest in RAM, not comparators. Replacement Strategy:•BIG caches: any sane approach works well •REAL randomness assuages paranoia! Performance analysis:•Tedious hand synthesis may build intuition from simple examples, BUT •Computer simulation of cache behavior on REAL programs (or using REAL trace data) is the basis for most real-world cache design decisions.

Description
Cache memory is a memory which is placed in between RAM and processor (CPU) in order to increase the speed of processor. There are different types of Cache design methodoligies such as associtivity, direct mapped cache and they are explained here very nicely with the help of diagrams.

“Prof. Steve Ward, 6.004-16, Cache design issues , 6.004 Computation Structures, Electrical Engineering and Computer Science , Engineering, Massachusetts Institute of Technology: MIT Open Course Ware,http://ocw.mit.edu (08-08-2011).License: Creative Commons BY-NC-SA: http://ocw.mit.edu/terms/#cc".

Comments

Want to learn?

Sign up and browse through relevant courses.

Name:
Your Email:
Password:
Country:
Contact no:


Area code Number
Subjects you are interested in:
Word verification: (Enter the text as in image)


Sign Up Already a member? Sign In
I agree to WizIQ's User Agreement & Privacy Policy
LearnOnline Through OCW
OpenCourseWare
User
102 Followers

Your Facebook Friends on WizIQ

Give live classes, create & sell online courses

Try it free Plans & Pricing

Connect