MIT OpenCourseWarehttp://ocw.mit.edu For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. 6.004 Computation Structures Spring 2009 L23 – Pipeline Issues 1 6.004 – Spring 20095/5/09Pipeline Issues This pipeline stuff makes my head hurt! Maybe it’s that dumb hat Home Stretch: Lab #8 due Thursday 5/7; Quiz 5 FRIDAY 5/8!modified 5/4/09 10:06 L23 – Pipeline Issues 2 6.004 – Spring 20095/5/09Recalling Data Hazards IFRFALUWBii+1i+2i+3i+4i+5i+6ADDCMPMULSUBADDCMPMULSUBADDCMPMULSUBADDCMPMULSUBADD(r1, r2, r3) CMPLEC(r3, 100, r0) MULC(r3, 100, r4) SUB(r0, r4, r5) PROBLEM: Subsequent instructions can reference the contents of a register well before the pipeline stage where the register is written.SOLUTION #2: Add special hardware to maintain the sequential execution semantics of the ISA. SOLUTION #1: Deal with it in SOFTWARE; expose the pipeline for all to see. L23 – Pipeline Issues 3 6.004 – Spring 20095/5/09Bypass Paths RegisterFileWAWDWEALUABYIRWBIRALURegisterFileRA1RA2RD1RD2IRRFYWBBABypass muxes MULC(r3,100,r4) SUB(r0,r4,r5) CMPLEC(r3,100,r0) Add special data paths, called BYPASSES, that route the results of the ALU and WB stages to the RF stage, thus substituting the register’s old contents with a value that will be written to that register at some point in the future. Detection of these cases has to be incorporated into the decoding logic of the RF stage, which basically looks at the instructions in the ALU and WB stage to see if their destination register matches a source register reference. IFRFALU WBBut there are some problems that BYPASSING CAN’T FIX! L23 – Pipeline Issues 4 6.004 – Spring 20095/5/09Load Hazards Consider LOADS: Can we fix all these problems using bypass paths? LDADDXORLDADDXORLDADDXORLDADDXORLD(r1, 0, r4) ADD(r4, r1, r5) XOR(r3, r4, r6) The hazard between the XOR and the LD can be addressed by our established bypass paths… L23 – Pipeline Issues 5 6.004 – Spring 20095/5/09Load Hazard (easy) (NB:SAMERF AS ABOVE!) +4PCRFPCMEMPCALURb: <15:11>Ra <20:16>RA2SELRc <25:21>InstructionMemoryADIRRFInstructionFetch+C: <15:0> << 2sign-extendedRegisterFileWAWDWEWDSEL0 1 2IRMEMDMEMYMEMDALUBIRALUA01BSELZALUABYRDData MemoryWDAdrR/WRegisterFileALUWriteBackASEL01Rc <25:21>+CPCSELJT01234XAdrILLOPPCIF0001XPWASELWARegisterFileRA1RA2RD1RD2JTC: <15:0>sign-extendedALUFNWERFXOR(r3, r4, r6) LD(r1, 0, r4) The XOR operand r4 can simply be bypassed from the output of the memory in the WB stage to the RF stage… by our normal bypass path. L23 – Pipeline Issues 6 6.004 – Spring 20095/5/09Structural Data Hazard The XOR hazard is pretty easy, but… LDADDXORLDADDXORLDADDXORLDADDXORLD(r1, 0, r4) ADD(r1, r4, r5) XOR(r3, r4, r6) ???How do we fix thisone? In a 4-stage pipeline, for a LD instruction fetched during clock i, the data from memory isn’t returned from memory until late into cycle i+3. Bypassing can fix the XOR but not ADD! L23 – Pipeline Issues 7 6.004 – Spring 20095/5/09Load Hazard (hard) (NB:SAMERF AS ABOVE!) +4PCRFPCMEMPCALURb: <15:11>Ra <20:16>RA2SELRc <25:21>InstructionMemoryADIRRFInstructionFetch+C: <15:0> << 2sign-extendedRegisterFileWAWDWEWDSEL0 1 2IRMEMDMEMYMEMDALUBIRALUA01BSELZALUABYRDData MemoryWDAdrR/WRegisterFileALUWriteBackASEL01Rc <25:21>+CPCSELJT01234XAdrILLOPPCIF0001XPWASELWARegisterFileRA1RA2RD1RD2JTC: <15:0>sign-extendedALUFNWERFADD(r4, r1, r5) LD(r1, 0, r4) The r4 operand to the ADD instruction hasn’t yet been fetched from memory. It exists NOWHERE on our data paths – we can’t solve this problem by bypassing! L23 – Pipeline Issues 8 6.004 – Spring 20095/5/09Load Delay Bypassing can’t fix the problem with ADD since the data simply isn’t available! We have to add some pipeline interlock hardware to stall ADD’s execution. LDADDXORLDADDXORLDADDXORLDADDXORLD(r1, 0, r4) ADD(r1, r4, r5) XOR(r3, r4, r6) XORADDNOPNOPIf the compiler knows about a machine’s load delay, it can often rearrange code sequences to eliminate such hazards. Many compilers provide machine-specific instruction scheduling.L23 – Pipeline Issues 9 6.004 – Spring 20095/5/09Stall Logic (1)Freeze IF, RF stages(2)Introduce NOP into ALU stage(3)Wait until operand is available (NB:SAMERF AS ABOVE!) +4PCRFPCMEMPCALURb: <15:11>Ra <20:16>RA2SELRc <25:21>InstructionMemoryADIRRFInstructionFetch+C: <15:0> << 2sign-extendedRegisterFileWAWDWEWDSEL0 1 2IRMEMDMEMYMEMDALUBIRALUA01BSELZALUABYRDData MemoryWDAdrR/WRegisterFileALUWriteBackASEL01Rc <25:21>+CPCSELJT01234XAdrILLOPPCIF0001XPWASELWARegisterFileRA1RA2RD1RD2JTC: <15:0>sign-extendedALUFNWERFADD(r4, r1, r5) LD(r1, 0, r4) STALL LELELE01AnnulRFNOPLD(r1, 0, r4) NOPL23 – Pipeline Issues 10 6.004 – Spring 20095/5/09Memory Timing & Pipelining But, but, what about FASTER processors? FACT: Processors have become very fast relative to memories! And this gap continues to grow… Do we just increase the clock period to accommodate this bottleneck component? ALTERNATIVE: Longer pipelines. 1. Add “MEMORY WAIT” stages between START of read operation & return of data. 2. Build pipelined memories, so that multiple (say, N) memory transactions can be in progress at once. These steps add load delay slots; hence 3. Stall pipeline on unbypassable load delays. A 4-Stage pipeline requires READ access in less than one clock. A 5-Stage pipeline would allow nearly two clocks... L23 – Pipeline Issues 11 6.004 – Spring 20095/5/095-stagePipeline+4PCRFPCWBPCMEMPCALURb: <15:11>Ra <20:16>RA2SELRc <25:21>InstructionMemoryADIRRF• Omits some detail• NO bypass or interlock logicInstructionFetch+C: <15:0> << 2sign-extendedData MemoryRDRegisterFileWAWDWEWDSEL0 1 2IRWBYWBIRMEMDMEMYMEMDALUBIRALUABSEL01ZALUABYWDAdrR/WRegisterFileALUWriteBackMemoryASEL01Rc <25:21>+4+C*4PCSELJT01234XAdrILLOPPCIF0001XPWASELWARegisterFileRA1RA2RD1RD2JTC: <15:0>sign-extendedALUFNWERFAddress available right after instruction enters Memory pipe stage Data needed right before rising clock edge at end of Write Back pipe stage almost 2 clock cycles L23 – Pipeline Issues 12 6.004 – Spring 20095/5/095-stagepipeline+4PCRFPCWBPCMEMPCALURb: <15:11>Ra <20:16>RA2SELRc <25:21>InstructionMemoryADIRRF• Omits some detail• NO bypass or interlock logicInstructionFetch+C: <15:0> << 2sign-extendedData MemoryRDRegisterFileWAWDWEWDSEL0 1 2IRWBYWBIRMEMDMEMYMEMDALUBIRALUABSEL01ZALUABYWDAdrR/WRegisterFileALUWriteBackMemoryASEL01Rc <25:21>+4+C*4PCSELJT01234XAdrILLOPPCIF0001XPWASELWARegisterFileRA1RA2RD1RD2JTC: <15:0>sign-extendedALUFNWERFWe wanted a simple, clean pipeline but… • added IRIF mux to annul branch-slot instructions NOP• added LE/muxes to freeze IF/RF stage so we can wait for LD to reach WB stage NOP• added A/B bypass muxes to get data before it’s written to regfile L23 – Pipeline Issues 13 6.004 – Spring 20095/5/09RF-stage Bypass Details RD1 RD2 Register File A Bypass B Bypass ASEL01ABSEL01BPC+4+4*SXT(C)SXT(C)To ALU To ALU DTo Mem JTZfrom ALU/MEM/WB from ALU/MEM/WB Note: can use “distributed” mux built from tristate drivers. We’ve been a littlesloppy about this detail1, 2, 3, 4… Hum, it looks like we have a few extra inputs on that muxL23 – Pipeline Issues 14 6.004 – Spring 20095/5/09Bypass Implementation ABYALU YMEMto alu from regs to regs Bypass From ALU Bypass From MEM Bypass From WB Select “0” ••••••••••••••from mem aluPCALU To reduce the amount of bypass logic, the WDSEL mux has been split: choice between ALU and PC+4 is made in ALU stage, choice between ALU/PC and MEMDATA is made is WB stage. Part 1 of the WDSEL mux Part 2 of the WDSEL mux L23 – Pipeline Issues 15 6.004 – Spring 20095/5/09Bypass Logic Ra or Rb/Rc Rc (WB)* Rc (MEM)* Rc (ALU)* “31” Select “0” ALU bypass MEM bypass WB bypass Regfile (no bypass) Beta Bypass logic (need two copies for A/B data): 5-bitcompare * If instruction is a ST (doesn’t write into regfile), set RC for ALU/MEM/WB to R31 L23 – Pipeline Issues 16 6.004 – Spring 20095/5/09Linkage Register Write Timing | The code: Assume Reg[LP] = 100... ADD(r31, r31, LP) BR(f, LP) | Reg[LP] = .+4 x:SUB(LP, 1, LP) ...f:XOR(LP, r31, r0) OR(r31, LP, r1) ADD(r31, LP, r2) IFRFALU MEMWBii+1i+2i+3i+4i+5i+6BR Decision Time ADDBRSUBXORORADDADDBRNOPXORORADDADDBRNOPXORORADDBRNOPXORADDBRADD writes BR writes Can we make XOR’s regfile access work by bypassing? NOPL23 – Pipeline Issues 17 6.004 – Spring 20095/5/09BR/JMPPC bypass BR(…,r28)+4InstructionMemoryADPCSELJTXAdrILLOPRb:<15:11>Ra:<20:16>Rc:<25:21>Instruction FetchRegister FileWAWDWEYMEMALUABRegister FileALUWrite BackMemoryRc:<25:21>Register FileRA1RA2RD1RD2PCRF+4+4*SXT(C)Data MemoryRDYPCSEL00RA2SELIRRF+WDSEL0 1 201BSELA, B BYPASSZJTALUFNWERF01AnnulIFNOP01234WDAdrR/WDALUBIRALUA01A,B Bypass A,B Bypass NOPXOR(r28…)DMEMIRMEMPCRFPCALUPCMEMYWBA, B BYPASSA, B BYPASSIRWBPCWBBYPASSESBYPASSESSXT(C)01ASELFor BR/JMP, Rc value is taken from PC, not ALU. So we have to add bypass paths for PCALU and PCMEM.PCWB is already taken care of if we bypass WB stage from output of WDSEL mux.01AnnulRFNOPPCRF+4+4*SXT(C)STALLSTALLSTALLL23 – Pipeline Issues 18 6.004 – Spring 20095/5/09Unused Opcode Traps IDEA: TRAP illegal instructions to a special routine in the Operating System, which can • Interpret them in software; or • Print humane error report. | User program: ...BAD(...)| Illegal instr. r:...| Operating System: UUO Handler IllOp:ST(r0,...)| Save a reg, LD(xp,-4,r0)| Fetch bad instr ...LD(...,r0)| Restore regs, JMP(xp)| Return to pgm. IMPLEMENTATION: On Bad Opcode (discovered in RF Stage): • Select IllOp adr as next PC • Annul instruction in IF stage • Substitute BNE(r31,0,XP) for bad instruction -will (eventually) store PC+4 into XP ... need bypass paths to make XP usable immediately by code at IllOp! L23 – Pipeline Issues 19 6.004 – Spring 20095/5/09Illegal Opcode Traps +4InstructionMemoryADPCSELJTXAdrILLOPRb:<15:11>Ra:<20:16>Rc:<25:21>Instruction FetchRegister FileWAWDWEYMEMALUABRegister FileALUWrite BackMemoryRc:<25:21>Register FileRA1RA2RD1RD2PCRF+4+4*SXT(C)Data MemoryRDYPCSEL00RA2SELIRRF+WDSEL0 1 201BSELA, B BYPASSZJTALUFNWERF01AnnulIFNOPBNE(R31,0,XP)01234WDAdrR/WDALUBIRALUA01BADBNE(…,XP)DMEMIRMEMPCRFPCALUPCMEMYWBA, B BYPASSA, B BYPASSIRWBPCWBBYPASSESBYPASSESSXT(C)01ASEL01AnnulRFNOPPCRF+4+4*SXT(C)STALLSTALLSTALL2A, B BYPASSA, B BYPASS???NOPBad opcode decoded in RF stage: • PC address of IllOp handler • Annul instruction in IF • Force BNE(R31,0,XP) in RF stage will save PC+4 in XP when it reaches WB stage L23 – Pipeline Issues 20 6.004 – Spring 20095/5/09Taking Exception… In general, we’d like to annul ALL instructions following one that causes a trap or fault: FREEZE state at time of exception, for inspection by handler code. ILLEGAL INSTRUCTIONS are recognizable in RF stage of pipe; are ALL faults & traps? CONSIDER: ARITHMETIC EXCEPTIONS: divide by zero, etc. • Caught by ALU subsystem, during processing of data in ALU stageMEMORY FAULTS: Program reference to illegal memory location... • Caught by MEMORY subsystem, during processing of address input in MEM stage L23 – Pipeline Issues 21 6.004 – Spring 20095/5/09Annulment Logic +4InstructionMemoryADPCSELJTXAdrILLOPRb:<15:11>Ra:<20:16>Rc:<25:21>Instruction FetchRegister FileWAWDWEYMEMALUABRegister FileALUWrite BackMemoryRc:<25:21>Register FileRA1RA2RD1RD2PCRF+4+4*SXT(C)Data MemoryRDYPCSEL00RA2SELIRRF+WDSEL0 1 201BSELA, B BYPASSZJTALUFNWERF01AnnulIFNOPBNE(R31,0,XP)01234WDAdrR/WDALUBIRALUA01DMEMIRMEMPCRFPCALUPCMEMYWBA, B BYPASSA, B BYPASSIRWBPCWBBYPASSESBYPASSESSXT(C)01ASEL01AnnulRFNOPPCRF+4+4*SXT(C)STALLSTALLSTALL2A, B BYPASSA, B BYPASSBNE(R31,0,XP)01AnnulALUNOP2BNE(R31,0,XP)01AnnulMEMNOP2Fault in ??? stage: • PC address of fault handler • Force BNE(R31,0,XP) in ??? stage will save PC+4 in XP when it reaches WB stage • Annul all followinginstructions (those earlier in the pipeline): called “flushing the pipe” MEMFAULTL23 – Pipeline Issues 22 6.004 – Spring 20095/5/09Asynchronous I/O Interrupts This should be easy. Take, for example, | The interrupted code: ...ADD(...)SUB(...)MUL(...)XOR(...)...| The interrupt handler: xh:OR(...)...JMP(xp)Interrupt Taken HERESuppose key struck, interrupt requested (via IRQ) during the fetch of ADD. Then let’s • Select XAdr (handler) as next PC• Leave ADD in pipeline; NO annulment! • Code handler to return to SUB instruction. Can this work??? Let’s find out… L23 – Pipeline Issues 23 6.004 – Spring 20095/5/09Asynchronous Interrupt Timing ...ADD(...)SUB(...)MUL(...)XOR(...)...xh:OR(...) | interrupt handler ...JMP(xp)Interrupt Taken HEREIFRFALU MEMWBii+1i+2i+3i+4i+5i+6...ADD...ORADD...ORADD...ORADD...ORADD...ORPROBLEM: When does old PC+4 get written to XP??? Interrupt Occurs PC = Xadr?? L23 – Pipeline Issues 24 6.004 – Spring 20095/5/09Making Interrupts Work Alternative: When taking interrupt, • ANNUL instruction in IF stage... BUT instead of changing it to a NOP, change it to BNE(r31,0,XP) This will cause PC+4 of annulled instruction to be written to XP! • CODE HANDLER to return to Reg[XP]-4 (since the annulled instruction is never executed) IFRFALU MEMWBii+1i+2i+3i+4i+5i+6ADDBNEBNEBNEBNE......OR...OR...OR...ORORInterrupt taken L23 – Pipeline Issues 25 6.004 – Spring 20095/5/09“Smart” Interrupt Handler | The interrupted code: ...ADD(...)SUB(...)MUL(...)XOR(...)...| The interrupt handler: xh:OR(...)...SUBC(xp,4,xp) | Adjust XP so weJMP(xp) | return to annulled | instruction (ADD) Interrupt taken HERE, ADD instruction annulled L23 – Pipeline Issues 26 6.004 – Spring 20095/5/095-stage Pipeline: Final Version +4InstructionMemoryADPCSELJTXAdrILLOPRb:<15:11>Ra:<20:16>Rc:<25:21>Instruction FetchRegister FileWAWDWEYMEMALUABRegister FileALUWrite BackMemoryRc:<25:21>Register FileRA1RA2RD1RD2PCRF+4+4*SXT(C)Data MemoryRDYPCSEL00RA2SELIRRF+WDSEL0 1 201BSELA, B BYPASSZJTALUFNWERF01AnnulIFNOPBNE(R31,0,XP)01234WDAdrR/WDALUBIRALUA01DMEMIRMEMPCRFPCALUPCMEMYWBA, B BYPASSA, B BYPASSIRWBPCWBBYPASSESBYPASSESSXT(C)01ASEL01AnnulRFNOPPCRF+4+4*SXT(C)STALLSTALLSTALL2A, B BYPASSA, B BYPASSBNE(R31,0,XP)01AnnulALUNOP2BNE(R31,0,XP)01AnnulMEMNOP22BNE(R31,0,XP)• Can annul instruction at each stage • Can force instruction to BNE(R31,0,XP) in each stagewill save PC+4 in XP when it reaches WB stage• Can stall IF and RF stages while waiting for LD result to reach WB • Can bypass results from ALU, MEM and WB back to RF L23 – Pipeline Issues 27 6.004 – Spring 20095/5/09Pipeline Review Simple unpipelined Beta: •1 cycle/instruction •long cycle time: mem+regs+alu+mem2-Stage pipeline: •increased throughput (<2x) •introduced branch delay slots Choice of executing or annulling inst. after branch 5-stage pipeline: •increased throughput (3x???) •branch delay slots •delayed register writeback (3 cycles) Add bypass paths (10) to access correct value•memory data available only in WB stage Introduce NOPs at IRALU, stall IF and RF stages until LD result ready •handle RF/ALU/MEM stage exceptions Save PC+4 in XP (fake a BR) annul following insts. (those earlier in pipeline) •implement interrupts Throw away IF inst., save PC+4 in XP, fix return •extra HW due to pipelining Registers to hold values between stagesData bypass muxes in RF stage Inst. muxes for “rewriting” code to annul or save PCL23 – Pipeline Issues 28 6.004 – Spring 20095/5/09RISC = Simplicity??? Generalization of registers and operand coding Complex instructions, addressing modes Addressing features, eg index registers RISCsPrimitive Machines: directimplementations VLIWs, Super-Scalars ?Pipelines, Bypasses, Annulment, …, ... “The P.T. Barnum World’s Tallest Dwarf Competition” World’s Most Complex RISC?