MIT OpenCourseWarehttp://ocw.mit.edu For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. 6.004 Computation Structures Spring 2009 L24– Parallel Processing 1 6.004 Spring 20095/7/09Parallel Processing modified 5/4/09 10:08 L24– Parallel Processing 2 6.004 Spring 20095/7/09The Home Stretch TODAY 5/7:Lab 8 (LAST!) due Friday 5/8 section: LAST QUIZ (#5)!Tu 5/12: Wrapup (LAST!) Lecture! Wednesday 5/13: NO SECTION MEETINGS!Optional DESIGN PROJECT due ALL (late) Assignments due Immense Satisfaction/Rejoicing/Relief/Celebration/Wild Partying.L24– Parallel Processing 3 6.004 Spring 20095/7/09Taking a step back DynamicExecution Path Static Code Path Length = number of instructions along path aka “Thread of Execution” loop: LD(n, r1) CMPLT(r31, r1, r2) BF(r2, done) LD(r, r3) LD(n,r1) MUL(r1, r3, r3) ST(r3, r) LD(n,r1) SUBC(r1, 1, r1) ST(r1, n) BR(loop) done: L24– Parallel Processing 4 6.004 Spring 20095/7/09We have been building machines to execute one thread (quickly) Beta Processor Memory Execution Thread Path Length x Clocks-per-Instruction Time = Clocks-per-second L24– Parallel Processing 5 6.004 Spring 20095/7/09Can we make CPI < 1 ? Two Places to Find Parallelism Instruction Level (ILP) – Fetch and issue groups of independent instructions within a thread of execution Thread Level (TLP) – Simultaneously execute multiple execution streams…Implies we can complete more than one instruction each clock cycle! L24– Parallel Processing 6 6.004 Spring 20095/7/09Instruction-Level Parallelism Sequential Code This is okay, but smarter coding does better in this example! loop: LD(n, r1) CMPLT(r31, r1, r2) BF(r2, done) LD(r, r3) LD(n,r1) MUL(r1, r3, r3) ST(r3, r) LD(n,r4) SUBC(r4, 1, r4) ST(r4, n) BR(loop) done: What if I tried to do multiple iterations at once? loop: LD(n, r1) CMPLT(r31, r1, r2) BF(r2, done) LD(r, r3) LD(n,r1) LD(n,r4) MUL(r1, r3, r3) SUBC(r4, 1, r4) ST(r3, r) ST(r4, n) BR(loop) done: “Safe” Parallel Code L24– Parallel Processing 7 6.004 Spring 20095/7/09Superscalar Parallelism -Popular now, but the limits are near (8-issue) -Multiple instruction dispatch -Speculative execution L24– Parallel Processing 8 6.004 Spring 20095/7/09SIMD Processing (Single Intruction Multiple Data) Each datapath has its own local data (Register File) All data paths execute the same instruction Conditional branching is difficult…(What if only one CPU has R1 = 0?) Conditional operations are common in SIMD machines if (flag1) Rc = Ra Rb Global ANDing or ORing of flag registers are used for high-level controlReg File ALUPC+1 or Branch Reg File ALUReg File ALUReg File ALUData Memory Instruction Memory addraddrdatadataAddressing UnitControl Model: “hide” parallelism in primitives (eg, vector operations) This sort of construct is also becoming popular on modern uniprocessors L24– Parallel Processing 9 6.004 Spring 20095/7/09SIMD Coprocessing Units SIMD data path added to a traditional CPU core Register-only operands Core CPU handles memory trafficPartitionable Datapaths for variable-sized “PACKED OPERANDS” Reg File 64-bit ALU646464“Intel MMX, SSE” “Sparc VIS” L24– Parallel Processing 10 6.004 Spring 20095/7/09SIMD Coprocessing Units SIMD data path added to a traditional CPU core Register-only operandsCore CPU handles memory trafficPartitionable Datapaths for variable-sized “PACKED OPERANDS” Reg File 32-bit ALU64646432-bit ALUTwo 32-bit ALUs FA a b sco ciFA a b sco ciA32 B32 A31 B31 S32 S31 ......L24– Parallel Processing 11 6.004 Spring 20095/7/09SIMD Coprocessing Units SIMD data path added to a traditional CPU core Register-only operands Core CPU manages memory trafficPartitionable Datapaths for variable-sized “PACKED OPERANDS” Reg File 16-bit ALU64646416-bit ALU16-bit ALU16-bit ALUFour 16-bit ALUs Nice data size for: Graphics, Signal Processing, Multimedia Apps, etc. L24– Parallel Processing 12 6.004 Spring 20095/7/09SIMD Coprocessing Units SIMD data path added to a traditional CPU core Register-only operands Core CPU manages memory trafficPartitionable Datapaths for variable-sized “PACKED OPERANDS” Reg File 646464Eight8-bit ALUs 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU MMX instructions: PADDB -add bytes PADDW -add 16-bit words PADDD -add 32-bit words (unsigned & w/saturation) PSUB{B,W,D} – subtract PMULTLW – multiply low PMULTHW – multiply high PMADDW – multiply & add PACK – UNPACK – PAND – POR -L24– Parallel Processing 13 6.004 Spring 20095/7/09VLIW Variant of SIMD Parallelism (Very Long Instruction Word) A single-WIDE instruction controls multiple heterogeneous datapaths.Exposes parallelism to compiler (S/W vs. H/W) Register File Integer ALU #1 Floating Point Adder Integer ALU #2 Floating Point MultiplierFP Regs Instr. Fetch & Branch Prediction Load Store UnitInstr$Data $Memory Interface IOP1RC1RA1RB1IOP2RC2RA2RB2FOPFD1FA1FB1FD2FA2FB2MemOPL24– Parallel Processing 14 6.004 Spring 20095/7/09Multiple Instruction Streams: MIMD Exploiting Thread Level Parallelism All processors share a common main memory Leverages existing CPU designs Easy to map “Processes (threads)” to “Processors” Share data and program Communicate through shared memory UpgradeableProblems:ScalabilitySynchronization$$$$$Main Memory SMP – Symmetric Multi-Processor One thread per processor L24– Parallel Processing 15 6.004 Spring 20095/7/09Hmmm….does it even work? P1$1:x = 1 y = 2 Shared Memory, x = 1, y = 2 P2$2:x = 1 y = 2 Process Ax = 3; print(y);Process By = 4; print(x);Consider the following trivial processes running on P1 and P2:L24– Parallel Processing 16 6.004 Spring 20095/7/09What are the Possible Outcomes? SEQUENCEA printsB prints x=3; print(y); y=4; print(x);21x=3; y=4; print(y); print(x);21x=3; y=4; print(x); print(y);21y=4; x=3; print(x); print(y);21y=4; x=3; print(y); print(x);21y=4; print(x); x=3; print(y);21Plausible execution sequences: Process Ax = 3; print(y);Process By = 4; print(x);$1:x = 1 y = 2 $2:x = 1 y = 2 Hey, we get the same answer every time… Let’s go build it! X 3 X 4 L24– Parallel Processing 17 6.004 Spring 20095/7/09Uniprocessor Outcome But, what are the possible outcomes if we ran Process A and Process B on a single timed-shared processor?SEQUENCEA printsB prints x=3; print(y); y=4; print(x);23x=3; y=4; print(y); print(x);43x=3; y=4; print(x); print(y);43y=4; x=3; print(x); print(y);43y=4; x=3; print(y); print(x);43y=4; print(x); x=3; print(y);41Plausible Uniprocessor execution sequences: Process Ax = 3; print(y);Process By = 4; print(x);Notice that the outcome 2, 1 does not appear in this list! L24– Parallel Processing 18 6.004 Spring 20095/7/09Sequential Consistency Semantic constraint: Result of executing N parallel programs should correspond to someinterleaved execution on a single processor. Possible printed values: 2, 3; 4, 3; 4, 1. (each corresponds to at least one interleaved execution) IMPOSSIBLE printed values: 2, 1 (corresponds to NO valid interleaved execution). Process Ax = 3; print(y);Process By = 4; print(x);Shared Memoryint x=1, y=2; Weren’t caches supposed to be invisible to programs? L24– Parallel Processing 19 6.004 Spring 20095/7/09Cache Incoherence PROBLEM: “stale” values in cache ... Process By = 4; print(x);Process Ax = 3; print(y);Q: How does B know that A has changed the value of x?P1$1:x=3 y=2 Shared Memory P2$2:x=1 y=4 x=3, y=4 Does WRITE-THRU help? _______! NOThe problem is not that memory has stale values, but that other caches may! L24– Parallel Processing 20 6.004 Spring 20095/7/09“Snoopy” Caches P1$1:x=1 y=2 Shared Memory P2$2:x=1 y=2 x=1, y=2 IDEA:• P1 writes 3 into x; write-thru cache causes bus transaction. • P2, snooping, sees transaction on bus. INVALIDATES or UPDATES its cached x value. Presume WRITE-THRU caches! X 3 X 3 X 3 L24– Parallel Processing 21 6.004 Spring 20095/7/09Snoopy Cache Design Two-bit STATE in cache line encodes one of M, E, S, I states (“MESI” cache): INVALID: cache line unused. SHARED ACCESS: read-only, valid, not dirty. Shared with other read-only copies elsewhere. Must invalidate other copies before writing. EXCLUSIVE: exclusive copy, not dirty. On write becomes modified. MODIFIED: exclusive access; read-write, valid, dirty. Must be written back to memory eventually; meanwhile, can be written or read by local processor. 4-state FSM for each cache line! (FREE!!:Can redefine VALID and DIRTY bits)CurrentstateRead HitRead Miss,Snoop HitRead Miss,Snoop MissWrite HitWrite MissSnoop forReadSnoop forWriteModifiedModifiedInvalid(Wr-Back)Invalid(Wr-Back)ModifiedInvalid(Wr-Back)Shared(Push)Invalid(Push)ExclusiveExclusiveInvalidInvalidModifiedInvalidSharedInvalidSharedSharedInvalidInvalidModified(Invalidate)InvalidSharedInvalidInvalidXShared(Fill)Exclusive(Fill)XModified(Fill-Inv)XXL24– Parallel Processing 22 6.004 Spring 20095/7/09Who needs Sequential Consistency, anyway? ALTERNATIVE MEMORY SEMANTICS: “WEAK” consistency EASIER GOAL: Memory operations from each processor appear to be performed in order issued by that processor; Memory operations from different processors may overlap in arbitrary ways (not necessarily consistent with any interleaving). ALTERNATIVE APPROACH:• Weak consistency, by default; • MEMORY BARRIER instruction: stalls processor until all previous memory operations have completed. L24– Parallel Processing 23 6.004 Spring 20095/7/09MIMD Multicore Arrays •Can Leverage existing CPU designs /development tools •H/W focuses on communication 2-D Mesh /cache hierarchy/…) •S/W focuses on partitioning, extracting parallelism •“speculative execution” hacks http://www.tilera.com/•16 cores/32 thr •250W @2.3GHz •Transactional Mem •Thread Speculation •“scout” threads L24– Parallel Processing 24 6.004 Spring 20095/7/09Parallel Processing Summary Prospects for future CPU architectures:Pipelining -Well understood, but mined-out Superscalar -At its practical limits SIMD -Limited use for special applications VLIW -Returns controls to S/W… but inflexible Prospects for future Computer System architectures:Single-thread limits: forcing multicores, parallelism Brains work well, with dismal clock rates … parallelism? Needed: NEW models, NEW ideas, NEW approaches FINAL ANSWER: Its up to YOUR generation! Figure by MITOpenCourseWare. Block diagram of a multicore processor.•64 cores •22 W @700MHz •2D Mesh
Description
Here we discussed about parallel processing, Shared memory abd cache coherence topics such as Paraller Processing , Instruction level processing, Super scalar parallelism,SIMD processing, SIMD coprocessing units,Sequential coherence,Cache incoherence and snoopy caches.
“Prof. Steve Ward, 6.004-24, Parallel processing, shared memory, cache coherence, consistency criteria,6.004 Computation Structures, Electrical Engineering and Computer Science , Engineering, Massachusetts Institute of Technology: MIT Open Course Ware,http://ocw.mit.edu (11-08-2011).License: Creative Commons BY-NC-SA: http://ocw.mit.edu/terms/#cc".