6.004-22, Pipelined Beta implementation, bypassing

Add to Favourites
Post to:

MIT OpenCourseWarehttp://ocw.mit.edu For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. 6.004 Computation Structures Spring 2009 L22 – Pipelining the Beta 1 6.004 – Spring 20094/30/09Pipelining the Beta betta ('be-t&) n. Any of various species of small, brightly colored, long-finned freshwater fishes of the genus Betta,found in southeast Asia. beta (‘bA-t&, ‘bE-) n. 1. The second letter of the Greek alphabet. 2. The exemplary computer system used in 6.004. I don’t think they mean the fish... maybe they’ll give me partial credit... modified 4/27/09 11:17 Lab #7 due Tonight! L22 – Pipelining the Beta 2 6.004 – Spring 20094/30/09CPU Performance We’ve got a working Beta… can we make it fast?MIPS = Millions of Instructions/Second Freq = Clock Frequency, MHz CPI = Clocks per Instruction MIPS = FreqCPITo Increase MIPS: 1. DECREASE CPI. -RISC simplicity reduces CPI to 1.0. -CPI below 1.0? Tough... you’ll see multiple instruction issue machines in 6.823. 2. INCREASE Freq. -Freq limited by delay along longest combinational path; hence -PIPELININGis the key to improved performance through fast clocks. L22 – Pipelining the Beta 3 6.004 – Spring 20094/30/09Beta Timing New PC 4Fetch Inst. Control Logic Read Regs RA2SEL mux ASEL mux BSEL mux ALU Fetch data OFFSETWDSEL mux RF setup PC setup Mem setup PCSEL mux =0?CLKCLKWanted: longest paths Complications: •some apparent paths aren’t “possible” •operations have variable execution times (eg, ALU) •time axis is not to scale (eg, tPD,MEM is very big!) 4Control Fetch “precedence graph” PC+4ALU +OFFSETLD(R1,10,R0) LDR(X,R3)L22 – Pipelining the Beta 4 6.004 – Spring 20094/30/09Why isn’t this a 20-minute lecture? 1.The Beta isn’t combinational… Explicit state in register file, memory; Hidden state in PC. 2.Consecutive operations – instruction executions – interact: •Jumps, branches dynamically change instruction sequence •Communication through registers, memory Our goals: •Move slow components into separate pipeline stages, running clock faster •Maintain instruction semantics of unpipelined Beta as far as possibleWe’ve learned how to pipeline combinational circuits. What’s the big deal? L22 – Pipelining the Beta 5 6.004 – Spring 20094/30/09Ultimate Goal: 5-Stage Pipeline GOAL: Maintain (nearly) 1.0 CPI, but increase clock speed to barely include slowest components (mems, regfile, ALU) APPROACH: structure processor as 5-stage pipeline: IFInstruction Fetch stage: Maintains PC, fetches one instruction per cycle and passes it to WBWrite-Back stage: writes result back into register file. RFRegister File stage: Reads source operands from register file, passes them to ALU ALU stage: Performs indicated operation, passes result to MEM Memory stage: If it’s a LD, use ALU result as an address, pass mem data (or ALU result if not LD) to L22 – Pipelining the Beta 6 6.004 – Spring 20094/30/09First Steps: A Simple 2-Stage Pipeline ASEL01Data Memory RDWDAdrR/WWDSEL012WA<25:21>01XPPCIFJT+4InstructionMemoryAD<15:11><20:16>RA2SEL<25:21>+RegisterFileRA1RA2RD1RD2BSEL01ZALUABJTWAWDWEALUFNPC+401Wr01234XAdrILLOPWASELWERF00PCSEL<15:0>PCEXE00IREXEIFEXE L22 – Pipelining the Beta 7 6.004 – Spring 20094/30/092-Stage Pipelined Beta Operation ..ADDC(r1, 1, r2) SUBC(r1, 1, r3) XOR(r1, r5, r1) MUL(r2, r6, r0) ...Consider a sequence of instructions: Executed on our 2-stage pipeline: IFEXEii+1i+2i+3i+4i+5i+6...SUBCADDCMULXOR...SUBCADDCMULXORTIME (cycles) Pipeline L22 – Pipelining the Beta 8 6.004 – Spring 20094/30/09Pipeline Control Hazards BUT consider instead: IFEXEii+1i+2i+3i+4i+5i+6...CMPADDXORBT ...CMPADD?BT LOOP:ADD(r1, r3, r3) CMPLEC(r3, 100, r0) BT(r0, LOOP) XOR(r3, -1, r3) MUL(r1, r2, r2) ...This is the cycle where the branch decision is made… but we’ve already fetched the following instruction which should be executed only if branch is not taken! L22 – Pipelining the Beta 9 6.004 – Spring 20094/30/09Branch Delay Slots PROBLEM: One (or more) following instructions have been pre-fetched by the time a branch is taken. 2. “Program around it”. Either a)Follow each BR with a NOP instruction; or b)Make compiler clever enough to move USEFUL instructions into the branch delay slots i.Always execute instructions in delay slots ii.Conditionally execute instructions in delay slots NOP = ‘no-operation’, e.g. ADD(R31, R31, R31) POSSIBLE SOLUTIONS: 1.Make hardware “annul” instructions following branches which are taken, e.g., by disabling WERF and WR.L22 – Pipelining the Beta 10 6.004 – Spring 20094/30/09Branch Alternative 1 Make the hardware annul instructions in the branch delay slots of a takenbranch. IFEXEii+1i+2i+3i+4i+5i+6CMPADDXORBT CMPADDBT CMPADDBT CMPADDXORPros: same program runs on both unpipelined and pipelined hardware Cons: in SPEC benchmarks 14% of instructions are taken branches 12% of total cycles are annulled LOOP:ADD(r1, r3, r3) CMPLEC(r3, 100, r0) BT(r0, LOOP) XOR(r3, -1, r3) MUL(r1, r2, r2) ...Branch taken NOPL22 – Pipelining the Beta 11 6.004 – Spring 20094/30/09Branch Annulment Hardware ASEL01Data Memory RDWDAdrR/WWDSEL012WA<25:21>01XPPCIFJT+4InstructionMemoryAD<15:11><20:16>RA2SEL<25:21>+RegisterFileRA1RA2RD1RD2BSEL01ZALUABJTWAWDWEALUFNPC+401Wr01234XAdrILLOPWASELWERF00PCSEL<15:0>PCEXE00IREXEANNULIF01NOPL22 – Pipelining the Beta 12 6.004 – Spring 20094/30/09Branch Alternative 2a Fill branch delay slots with NOP instructions (i.e., the software equivalent of alternative 1) IFEXEii+1i+2i+3i+4i+5i+6CMPADDNOPBT CMPADDBT CMPADDBT CMPADDNOPBranch taken Pros: same as alternative 1 Cons: NOPs make code longer; 12% of cycles spent executing NOPs LOOP:ADD(r1, r3, r3) CMPLEC(r3, 100, r0) BT(r0, LOOP) NOP()XOR(r3, -1, r3) MUL(r1, r2, r2) ... L22 – Pipelining the Beta 13 6.004 – Spring 20094/30/09Branch Alternative 2b(i) Put USEFUL instructions in the branch delay slots; remember they will be executed whether the branch is taken or not IFEXEii+1i+2i+3i+4i+5i+6CMPADDADDBT CMPADDBT BT CMPADDBT CMPADDBranch taken Pros: only two “extra” instructions are executed (on last iteration) Cons: finding “useful” instructions that are always executed is difficult; clever rewrite may be required. Program executes differently on naïve unpipelined implementation. LOOP:ADD(r1,r3,r3)LOOPx: CMPLEC(r3,100,r0) BT(r0,LOOPx)ADD(r1,r3,r3)SUB(r3,r1,r3)XOR(r3,-1,r3)MUL(r1,r2,r2)...We need to add this silly instruction to UNDO the effects of that last ADDL22 – Pipelining the Beta 14 6.004 – Spring 20094/30/09Branch Alternative 2b(ii) Put USEFUL instructions in the branch delay slots; annul them if branch doesn’t behave as predicted IFEXEii+1i+2i+3i+4i+5i+6CMPADDADDBT CMPADDBT BT CMPADDBT CMPADDBranch taken Pros: only one instruction is annulled (on last iteration); about 70% of branch delay slots can be filled with useful instructions Cons: Program executes differently on naïve unpipelined implementation; not really useful with more than one delay slot. LOOP:ADD(r1, r3, r3) LOOPx: CMPLEC(r3, 100, r0) BT.taken(r0,LOOPx)ADD(r1, r3, r3) XOR(r3, -1, r3) MUL(r1, r2, r2) ...L22 – Pipelining the Beta 15 6.004 – Spring 20094/30/09Architectural Issue: Branch Decision Timing BETA approach: • SIMPLE branch condition logic ... Test for Reg[Ra] = 0! • ADVANTAGE: early decision, single delay slot ALTERNATIVES: • Compare-and-branch... (eg, if Reg[Ra] > Reg[Rb]) • MORE powerful, but • LATER decision (hence more delay slots) IFinstruction Instruction Fetch ALU ALU CLABinstruction Register File CLRF (read) instruction instruction YWrite Back CLRF (write) instruction YMemory CLSuppose decision were made in the ALU stage ... then there would be 2 branch delay slots (and instructions to annul!) Wow! I guess those guys really were thinking when they made up all those instructions L22 – Pipelining the Beta 16 6.004 – Spring 20094/30/09(NB:SAMERF AS ABOVE!) +4PCRFPCMEMPCALURb: <15:11>Ra <20:16>RA2SELRc <25:21>InstructionMemoryADIRRFInstructionFetch+C: <15:0> << 2sign-extendedRDRegisterFileWAWDWEWDSEL0 1 2IRMEMDMEMYMEMDALUBIRALUA01BSELZALUABYData MemoryWDAdrR/WRegisterFileALUWriteBackASEL01Rc <25:21>+CPCSELJT01234XAdrILLOPPCIF0001XPWASELWARegisterFileRA1RA2RD1RD2JTC: <15:0>sign-extendedALUFNWERF4-Stage Beta Pipeline Treat register file as two separate devices: combinational READ, clocked WRITE at end of pipe. What other information do we have to pass down pipeline? PC instruction fields What sort of improvement should expect in cycle time? (return addresses) (decoding) L22 – Pipelining the Beta 17 6.004 – Spring 20094/30/094-Stage Beta Operation ...ADDC(r1, 1, r2) SUBC(r1, 1, r3) XOR(r1, r5, r1) MUL(r2, r6, r0) ...Consider a sequence of instructions: Executed on our 4-stage pipeline: ...SUBCADDCMULXOR...SUBCADDCMULXOR...SUBCADDCMULXORSUBCADDCMULXORTIME (cycles) Pipeline IFRFALUWBii+1i+2i+3i+4i+5i+6L22 – Pipelining the Beta 18 6.004 – Spring 20094/30/09r3fetched r3available Pipeline “Data Hazard” BUT consider instead: ADD(r1, r2, r3) CMPLEC(r3, 100, r0) MULC(r1, 100, r4) SUB(r1, r2, r5) ADDCMPMULADDADDADDCMPCMPCMPSUBMULMULMULSUBSUBSUBOops! CMP is trying to read Reg[R3] during cycle i+2 but ADD doesn’t write its result into Reg[R3] until the end of cycle i+3! IFRFALUWBii+1i+2i+3i+4i+5i+6L22 – Pipelining the Beta 19 6.004 – Spring 20094/30/09Data Hazard Solution 1 “Program around it” ... document weirdo semantics, declare it a software problem. -Breaks sequential semantics! -Costs code efficiency. ADD(r1, r2, r3) CMPLEC(r3, 100, r0) MULC(r1, 100, r4) SUB(r1, r2, r5) ADD(r1, r2, r3) MULC(r1, 100, r4) SUB(r1, r2, r5) CMPLEC(r3, 100, r0) EXAMPLE: Rewrite asHow often can we do this? Programmer’s fallback: Insert NOPs (sigh!) L22 – Pipelining the Beta 20 6.004 – Spring 20094/30/09Data Hazard Solution 2 Stall the pipeline: Freeze IF, RF stages for 2 cycles, inserting NOPs into ALU-stage instruction register IFRFALUWBii+1i+2i+3i+4i+5i+6ADDCMPMULSUBADDCMPMULSUBADDADDCMPCMPMULNOPNOPNOPNOPMULCMPMULCMPDrawback: NOPs mean “wasted” cycles L22 – Pipelining the Beta 21 6.004 – Spring 20094/30/09Data Hazard Solution 3 Bypass (aka forwarding) Paths: Add extra data paths & control logic to re-route data in problem cases. ADDCMPMULSUBADDCMPMULSUBr3available ADDCMPMULSUBADDCMPMULSUBIdea: the result from the ADD which will be written into the register file at the end of cycle I+3 is actually available at output of ALU during cycle I+2 – just in time for it to be used by CMP in the RF stage! IFRFALUWBii+1i+2i+3i+4i+5i+6L22 – Pipelining the Beta 22 6.004 – Spring 20094/30/09Bypass Paths (I) RegisterFileWAWDWEALUABYIRWBIRALURegisterFileRA1RA2RD1RD2IRRFYWBBABypass muxes SELECT this BYPASS path if OpCodeRF = reads Ra and OpCodeALU = OP, OPC and RaRF = RcALU i.e., instructions which use ALU to compute result and RaRF != R31 ADD(r1,r2,r3)CMPLEC(r3,100,r0) L22 – Pipelining the Beta 23 6.004 – Spring 20094/30/09Bypass Paths (II) RegisterFileWAWDWEALUABYIRWBIRALURegisterFileRA1RA2RD1RD2IRRFYWBBABypass muxes MULC(r4,17,r5) ADD(r1,r2,r3)XOR(r2,r6,r1)SELECT this BYPASS path if OpCodeRF = reads Ra and RaRF != R31 and not using ALU bypass and WERF = 1 and RaRF = WA But why can’t we get It from the register file? It’s being written this cycle! L22 – Pipelining the Beta 24 6.004 – Spring 20094/30/09Next Time More Beta Bypasses Ahead

Description
Here the implimentation of pipelined Beta is done through several stages. Here it is explained about CPU performance, pipeline control hazards ,Branch delay slots, Branch decision making and also about Data hazard solution.

“Prof. Steve Ward, 6.004-22, Pipelined Beta implementation, bypassing,6.004 Computation Structures, Electrical Engineering and Computer Science , Engineering, Massachusetts Institute of Technology: MIT Open Course Ware,http://ocw.mit.edu (08-08-2011).License: Creative Commons BY-NC-SA: http://ocw.mit.edu/terms/#cc".

Comments

Want to learn?

Sign up and browse through relevant courses.

Name:
Your Email:
Password:
Country:
Contact no:


Area code Number
Subjects you are interested in:
Word verification: (Enter the text as in image)


Sign Up Already a member? Sign In
I agree to WizIQ's User Agreement & Privacy Policy
LearnOnline Through OCW
OpenCourseWare
User
102 Followers

Your Facebook Friends on WizIQ

Give live classes, create & sell online courses

Try it free Plans & Pricing

Connect