Network Properties, Scalability and Requirements For
48 Slides1.69 MB
Network Properties, Scalability and Requirements For Parallel Processing Scalable Parallel Performance: Continue to achieve good parallel performance "speedup"as the sizes of the system/problem are increased. Scalability/characteristics of the parallel system network play an important role in determining performance scalability of the parallel architecture. Scalable Network N Size of Network Number of Nodes Communication assist (CA) Mem Compute Nodes Generic “Scalable” Multiprocessor Architecture P Node: processor(s), memory system, plus communication assist: Network interface and communication controller. Scalable network. 1 2 Two Aspects of Network Scalability: Performance and Cost/Complexity Function of a parallel machine network is to efficiently transfer information from source node to destination node in support of network transactions that realize the programming model. 1 Network performance should scale up as its size is increased. i.e network performance scalability Latency grows slowly with network size N. e.g O(log2 N) vs. O(N2) Total available bandwidth scales up with network size. e.g O(N) 2 Network cost/complexity should grow slowly in terms of network size. i.e network cost/complexity scalability O(Nlog (PP Chapter 1.3, PCAe.g. Chapter 10) 2 N) as Nopposed to O(N2) Size of Network (Number of Nodes) CMPE655 - Shaaban #1 lec # 8 Fall 2015 11-10-2015
Network Requirements For Parallel Computing } 1. Low network latency even when approaching network capacity. 2. High sustained bandwidth that matches or exceeds the communication requirements for given computational rate. For A given 3. High network throughput: Network should support as many concurrent transfers as possible. network Size 4. Low Protocol overhead. 5. Cost/complexity and performance Scalable: ToMinimum reduce communication overheads, O – Cost/Complexity Scalability: network cost/complexity increase as network size increases. – In terms of number of links/switches, node degree etc. Performance Scalability: Network performance should scale up with network size. Latency grows slowly with network size. - Total available bandwidth scales up with network size. As network Size Increases - Scalable Interconnection Network Scalable network Two Aspects of Network Scalability: Performance and Complexity network interface CA M CA P Nodes M P CMPE655 - Shaaban #2 lec # 8 Fall 2015 11-10-2015
Cost of Communication Given amount of comm (inherent or artifactual), goal is to reduce cost Cost of communication as seen by process: Cost of a message C f*(o l i.e total number of messages n B t - overlap) c Communication Cost: Actual time added to parallel execution time as a result of communication Latency of a message f frequency of messages o overhead per message (at both ends) l network delay per message n data sent for per message B bandwidth along path (determined by network, NI, assist) tc cost induced by contention per message overlap amount of latency hidden by overlap with comp. or comm. – Portion in parentheses is cost of a message (as seen by processor) – That portion, ignoring overlap, is latency of a message – Goal: reduce terms in latency and increase overlap CMPE655 - Shaaban From lecture 6 #3 lec # 8 Fall 2015 11-10-2015
Network Representation & Characteristics A parallel machine interconnection network is a graph V {switches or processing nodes} Routers connected by communication channels or links C V V Each channel has width w bits and signaling rate f 1/ is clock cycle time) – Channel bandwidth b wf bits/sec frequency – Phit (physical unit) data transferred per cycle (usually channel width w). – Flit - basic unit of flow-control (minimum data unit transferred across a link). Flow Unit Number of channels per node or switch is switch or node degree. (frame) Sequence of switches and links Unit or frame message the network a Flow route. Phit followed W W in W Flit isi.e Wby a W or data link layer unit – Routing Distance: number of links or hops h on route from source to destination. A network is generally characterized by: – Type of interconnection. – Topology. 1 2 3 S Source D Destination – Routing Algorithm. h 3 hops in route from S to D – Switching Strategy. Static (point-to-point) or Dynamic – Flow Control Mechanism. Network node connectivity/ interconnection structure of the network graph Deterministic (static) or Adaptive (dynamic) Packet or Circuit Switching Store & Forward (SF) or Cut-Through (CT) CMPE655 - Shaaban #4 lec # 8 Fall 2015 11-10-2015
Network Characteristics Type of interconnection: 1 – Static, Direct Dedicated (or point-to-point) Interconnects: Nodes connected directly using static point-to-point links. Such networks include: or channels – Fully connected networks , Rings, Meshes, Hypercubes etc. – Dynamic or Indirect Interconnects: 2 Switches are usually used to realize dynamic links (paths or virtual circuits ) between nodes instead of fixed point-to-point connections. Each node is connected to specific subset of switches. Dynamic connections are usually established by configuring switches based on communication demands. Such networks include: – Shared-, broadcast-, or bus-based connections. (e.g. Ethernet-based). – Single-stage Crossbar switch networks. Wireless Networks ? – Multi-stage Interconnection Networks (MINs) including: Omega Network, Baseline Network, Butterfly Network, etc. One large switch, Size NxN CMPE655 - Shaaban #5 lec # 8 Fall 2015 11-10-2015
Network Characteristics Network Topology: Or Network Graph Connectivity Physical interconnection structure of the network graph: – Node connectivity: Which nodes are directly connected nodes or switches Related: Bisection Bandwidth Bisection width x link bandwidth – Symmetry: The property that the network looks the same from every node. – Homogeneity: Whether all the nodes and links are identical or not. { Simplify Mapping Hop link channel in route CMPE655 - Shaaban #6 lec # 8 Fall 2015 11-10-2015 Network Topology Properties – Total number of links needed: Impacts network cost/total bandwidth Network Complexity – Node Degree: Number of channels per node. – Network diameter: Minimum routing distance in links or hops between the the farthest two nodes . – Average Distance in hops between all pairs of nodes . – Bisection width: Minimum number of links whose removal disconnects the network graph and cuts it into approximately two equal halves.
Network Topology and Requirements for Parallel Processing 1 2 3 4 5 For Cost/Complexity Scalability: The total number of links, node degree and size/number of switches used should grow slowly as the size of the network is increased. For Low network latency: Small network diameter, average distance are desirable (for a given network size). For Latency Scalability: The network diameter, average distance should grow slowly as the size of the network is increased. For Bandwidth Scalability: The total number of links should increase in proportion to network size. To support as many concurrent transfers as possible (High network throughput): A high bisection width is desirable and should increase proportional to network size. – Needed to reduce network contention and hot spots. More on this later in the lecture CMPE655 - Shaaban #7 lec # 8 Fall 2015 11-10-2015
Network Characteristics End-to-End Routing Algorithm and Functions: – The set of paths that messages may follow. 12- Deterministic Routing: The route taken by a message determined by source and destination Deterministic regardless(static) of other Routing: traffic in the network. Adaptive Routing: One of multiple routes from source to destination selected to account for traffic to Routing: reduce node/link contention. Adaptiveother (dynamic) Switching Strategy: – Circuit switching vs. packet switching. Flow Control Mechanism: – When a message or portions of it moves along its route: Store & Forward (SF)Routing, Done at/by Data Link Layer? Cut-Through (CT) or Worm-Hole Routing. (usually uses circuit switching) – What happens when traffic is encountered at a node: 1 Link/Node Contention handling. Deadlock prevention. AKA pipelined routing 2 Broadcast and multicast capabilities. Switch routing delay. Link bandwidth. e.g use buffering b CMPE655 - Shaaban #8 lec # 8 Fall 2015 11-10-2015
Network Characteristics Hardware/software implementation complexity/cost. Network throughput: Total number of messages handled by network per unit time. Aggregate Network bandwidth: Similar to network throughput but given in total bytes/sec. Network hot spots: Form in a network when a small number of network nodes/links handle a very large percentage of total network traffic and become saturated. Network scalability: Large – The feasibility of increasing network size, determined by: Contention Delay tc Performance scalability: Relationship between network size in terms of number of nodes and the resulting network performance (average latency, aggregate network bandwidth). Cost scalability: Relationship between network size in terms of number of nodes/links and network cost/complexity. Also number/size of switches for dynamic networks CMPE655 - Shaaban #9 lec # 8 Fall 2015 11-10-2015
Communication Network Performance : Network Latency Time to transfer n bytes from source to destination: Time(n)s-d overhead routing delay channel occupancy contention delay S Source D Destination i.e. Network Latency Unloaded Network Latency routing delay channel occupancy O i.e. no contention delay tc channel occupancy (n n e) / b b channel bandwidth, bytes/sec n payload size ne packet envelope: header, trailer. Effective link bandwidth bn / (n n e) i.e. transmission time The term for unloaded network latency is refined next by examining the impact of flow control mechanism used in the network Added to payload Next channel occupancy transmission time CMPE655 - Shaaban #10 lec # 8 Fall 2015 11-10-2015
Flow Control Mechanisms: Usually Done by Data Link Layer Store&Forward (SF) Vs. Cut-Through (CT) Routing AKA Worm-Hole or pipelined routing Cut-Through Routing Store & Forward Routing Source Dest 32 1 0 3 2 1 0 3 2 1 3 2 3 Dest 0 3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 1 0 2 1 0 3 2 1 0 3 2 1 3 2 3 i.e. no contention delay tc 0 3 2 1 0 1 0 2 1 0 3 2 1 0 3 2 1 Time 0 Unloaded network latency for n byte packet (message): h(n/b ) vs n/b h h distance in hops switch delay Channel occupancy (number of links in route) b link bandwidth n size of message in bytes Routing delay CMPE655 - Shaaban #11 lec # 8 Fall 2015 11-10-2015
Store &Forward (SF) Vs. Cut-Through (CT) Routing Example Example: For a route with h 3 hops or links, unloaded S 1 Source n/b i.e No contention delay tc Source 1 2 n/b Store & Forward (SF) n/b Tsf (n, h) h( n/b ) 3( n/b ) Source n/b n/b b link bandwidth h distance in hops 2 Destination Time n size of message in bytes switch delay Cut-Through (CT) AKA Worm-Hole or pipelined routing 3 n/b D 3 1 Destination Route with h 3 hops from S to D 2 3 Destination Time Tct (n, h) n/b h n/b 3 Channel occupancy Routing delay CMPE655 - Shaaban #12 lec # 8 Fall 2015 11-10-2015
Communication Network Performance : Refined Unloaded Network Latency Accounting For Flow Control (i.e no contention, Tc 0) For an unloaded network (no contention delay) the network latency to transfer an n byte packet (including packet envelope) across the network: Unloaded Network Latency channel occupancy routing delay For store-and-forward (sf) routing: Unloaded Network Latency Tsf (n, h) h( n/b ) For cut-through (ct) routing: Unloaded Network Latency Tct (n, h) n/b h b channel bandwidth h distance in hops n bytes transmitted switch delay (number of links in route) channel occupancy transmission time CMPE655 - Shaaban #13 lec # 8 Fall 2015 11-10-2015
* Reducing Unloaded Network Latency (i.e no contention, Tc 0) 1 Use cut-through routing: Channel occupancy Routing delay – Unloaded Network Latency Tct (n, h) n/b h 2 Reduce number of links or hops h in route: how? – Map communication patterns to network topology e.g. nearest-neighbor on mesh and ring; all-to-all Applicable to networks with static or direct point-to-point interconnects: Ideally network topology matches problem communication patterns. Increase link bandwidth b. 4 Reduce switch routing delay 3 * Unloaded implies no contention delay tc CMPE655 - Shaaban #14 lec # 8 Fall 2015 11-10-2015
Mapping of Task Communication Patterns to Topology Example Task Graph: T1 Parallel System Topology: P6 110 3D Binary Hypercube T2 T3 T4 P7 P4 111 P5 100 101 T5 P2 Poor Mapping: h 2 or 3 T1 runs on P0 T2 runs on P5 T3 runs on P6 T4 runs on P7 T5 runs on P0 010 P0 From lecture 6 011 P1 000 001 Better Mapping: Communication from T1 to T2 requires 2 hops Route: P0-P1-P5 Communication from T1 to T3 requires 2 hops Route: P0-P2-P6 Communication from T1 to T4 requires 3 hops Route: P0-P1-P3-P7 Communication from T2, T3, T4 to T5 similar routes to above reversed (2-3 hops) P3 h 1 T1 runs on P0 T2 runs on P1 T3 runs on P2 T4 runs on P4 T5 runs on P0 Communication between any two communicating (dependant) tasks requires just 1 hop h number of hops h in route from source to destination CMPE655 - Shaaban #15 lec # 8 Fall 2015 11-10-2015
Available Effective Bandwidth Factors affecting effective local link bandwidth available to a single node: – Accounting for Packet density b x n/(n ne) n Message Envelope 1 – Also Accounting for Routing delay b x n / (n n e w ) (headers/trailers) 2 – Contention: At endpoints. tc 3 Routing delay Within the network. At Communication Assists (CAs) tc Factors affecting throughput or Aggregate bandwidth: – Network bisection bandwidth: Sum of bandwidth of smallest set of links when removed partition the network into two 1 unconnected networks of equal size. – Total bandwidth of all the C channels: Cb bytes/sec, Cw bits per cycle or C phits per cycle. – Suppose N hosts each issue a message every M cycles with average routing distance h and 2 average distribution: Each message occupies h channels for l n/w cycles of size n bytes Total network load Nhl / M phits per cycle. Example Average Link utilization Total network load / Total bandwidth i.e uniform distribution over all channels Average Link utilization: Nhl /MC 1 e C phits Should be less than 1 Phit w channel width in bits b channel bandwidth n message size Note: equation 10.6 page 762 in the textbook is incorrect CMPE655 - Shaaban #16 lec # 8 Fall 2015 11-10-2015
Network Saturation 0.8 70 0.7 Latency 60 Delivered Bandwidth 80 High queuing Delays 50 40 Saturation 30 20 10 Link utilization 1 0.6 0.5 1 0.4 Saturation 0.3 1 0.2 0.1 0 0 0 0.2 0.4 0.6 0.8 Delivered Bandwidth 1 0 0.2 0.4 0.6 0.8 1 1.2 Potential or Offered Bandwidth Indications of Network Saturation Large Contention Delay tc CMPE655 - Shaaban #17 lec # 8 Fall 2015 11-10-2015
Network Performance Factors: Contention tc Network Hot Spots Network hot spots: Form in a network when a small number of network nodes/links handle a very large percentage of total network traffic and become saturated. Caused by communication load imbalance creating a high level of contention at these few nodes/links. Or messages Contention: Several packets trying to use the same link/node at same time. – May be caused by limited available buffering. – Possible resolutions/prevention: Drop one or more packets (once contention occurs). Increased buffer space. Use an alternative route (requires an adaptive routing i.e. algorithm Dynamic or routing to distribute load more evenly). To Prevent: Use a network with better bisection width (more routes). { i.e to resolve contention a better static Most networks used in parallel machines block in place: – Link-level flow control. Reduces hot spots and contention – Back pressure to the source to slow down flow of data. Example Next Causes contention delay tc CMPE655 - Shaaban #18 lec # 8 Fall 2015 11-10-2015
Reducing node/link contention: AKA Dynamic Deterministic Routing vs. Adaptive Routing AKA Static Example: Routing in 2D Mesh 1 Deterministic (static) Dimension Order Routing in 2D mesh: Each packet carries signed distance to travel in each dimension [ x, y]. First move message along x then along y. e.g. x then y (always) 2 Adaptive (dynamic) Routing in 2D mesh: Choose route along x, y dimensions according to link/node traffic to reduce node/link contention. – More complex to implement. Y then X ? x X then Y (always) y 1 Deterministic Dimension Routing along x then along y (node/link contention) 2 Adaptive (dynamic) Routing (reduced node/link contention) CMPE655 - Shaaban #19 lec # 8 Fall 2015 11-10-2015
Sample Static Network Topologies (Static or point-to-point) 3D 2D Linear 4D 2D Mesh Ring Hybercube Higher link bandwidth Closer to root Binary Tree Fat Binary Tree Fully Connected CMPE655 - Shaaban #20 lec # 8 Fall 2015 11-10-2015
Static Point-to-point Connection Network Topologies Match network graph (topology) to task graph Direct point-to-point links are used. Suitable for predictable communication patterns matching topology. Fully Connected Network: Every node is connected to all other nodes using N- 1 direct links N(N-1)/2 Links - O(N2) complexity Node Degree: N -1 Diameter 1 Average Distance 1 Bisection Width (N/2)2 Linear Array: AKA 1D Mesh N-1 Links - O(N) complexity Node Degree: 1-2 Diameter N -1 Average Distance 2/3N Bisection Width 1 Ring: Route A - B given by relative address R B-A N Links - O(N) complexity Node Degree: 2 Diameter N/2 Average Distance 1/3N Bisection Width 2 AKA 1D Torus Or Cube Examples: Token-Ring, FDDI, SCI (Dolphin interconnects SAN), FiberChannel Arbitrated Loop, KSR1 N Number of nodes CMPE655 - Shaaban #21 lec # 8 Fall 2015 11-10-2015
Static Network Topologies Examples: Multidimensional Meshes and Tori Toruses? K0 Nodes K0 1D mesh 1D torus K1 4x4 2D mesh d-dimensional array or mesh: – N kd-1 X .X k0 nodes kj – Described by d-vector of coordinates (id-1, ., i0) – Where 0 ij kj -1 for 0 j d-1 4x4 2D torus 3D binary cube (AKA 2-ary cube or Torus) nodes in each of d dimensions kj may not be equal in each dimension A node is connected to nodes that differ by one in every dimension d-dimensional k-ary mesh: N kd – k d N or N kd N Number of – Described by d-vector of radix k coordinate. – Diameter d(k-1) k nodes in each of d dimensions d-dimensional k-ary torus (or k-ary d-cube): – Edges wrap around, every node has degree 2d and connected to nodes that differ by one (mod k) in every dimension. nodes Mesh N Total number of nodes CMPE655 - Shaaban #22 lec # 8 Fall 2015 11-10-2015
Properties of d-dimensional k-ary Meshes and Tori (k-ary d-cubes) Routing: Deterministic or static – Dimension-order routing (both). k nodes in each of d dimensions Relative distance: R (b d-1 - a d-1, . , b0 - a0 ) Traverse ri b i - a i hops in each dimension. a Source Node b Destination Node Diameter: – d(k-1) for mesh For k 2 Diameter d (for both) – d k/2 for cube or torus Average Distance: – d x 2k/3 for mesh. Number of Nodes: – dk/3 for cube or torus. – N kd for both Node Degree: Number of Links: – d to 2d for mesh. – dN - dk for mesh – 2d for cube or torus. – dN d kd for cube or torus Bisection width: (More links due to wrap-around links) – k d-1 links for mesh. – 2k d-1 links for cube or torus. N Number of nodes CMPE655 - Shaaban #23 lec # 8 Fall 2015 11-10-2015
Static (point-to-point) Connection K 4 nodes in each dimension Networks Examples: 2D Mesh k 4 Node (2-dimensional k-ary mesh) For an k x k 2D Mesh: k 4 Number of nodes N k2 Node Degree: 2-4 Network diameter: 2(k-1) No of links: 2N - 2k Bisection Width: k Where k N How to transform 2D mesh into a 2D torus? Here k 4 N 16 Diameter 2(4-1) 6 Number of links 32 -8 24 Bisection width 4 CMPE655 - Shaaban #24 lec # 8 Fall 2015 11-10-2015
Static Connection Networks Examples Hypercubes Also called binary d-cubes ( k-ary d-cubes or tori with k 2 2-ary d-cube) Dimension d log2N Number of nodes N 2d Diameter: O(log2N) hops d Dimension Good bisection width: N/2 O( N Log N) Complexity: Or: Binary d-cube 2-ary d-torus Binary d-torus Binary d-mesh 2-ary d-mesh? 2 – Number of links: N(log2N)/2 – Node degree is d log2N 5-D 0-D 1-D 2-D 3-D A node is directly connected to d nodes with addresses that differ from its address in only one bit 4-D CMPE655 - Shaaban #25 lec # 8 Fall 2015 11-10-2015
3-D Hypercube Static Routing Example 111 110 Message Routing Functions Example Dimension-order (E-Cube) Routing 010 3-D Hypercube 011 Network Topology: 3-dimensional static-link hypercube Nodes denoted by C2C1C0 100 101 000 Routing by least significant bit C0 1st Dimension 000 001 001 010 011 100 101 110 111 010 011 100 101 110 111 011 100 101 110 111 Routing by middle bit C1 2nd Dimension 000 001 Routing by most significant bit C2 3rd Dimension 000 001 010 For Hypercubes: Diameter max hops d here d 3 CMPE655 - Shaaban #26 lec # 8 Fall 2015 11-10-2015
Static Connection Networks Examples: Trees Binary Tree k 2 Height/diameter/ average distance: O(log2 N) Diameter and average distance are logarithmic. – k-ary tree, height d logk N – Address specified d-vector of radix k coordinates describing path down from root. Fixed degree k. Route up to common ancestor and down: – R B XOR A (Not for leaves, for leaves degree 1) – Let i be position of most significant 1 in R, route up i 1 levels – Down in direction given by low i 1 bits of B H-tree space is O(N) with O( N) long wires. Low Bisection Width 1 Good? Or Bad? CMPE655 - Shaaban #27 lec # 8 Fall 2015 11-10-2015
Static Connection Networks Examples: Fat-Trees Higher Bisection Width Than Normal Tree Higher link bandwidth/more links closer to root node Root Node Fat Tree “Fatter” higher bandwidth links (more connections in reality) as you go up, so bisection bandwidth scales with number of nodes N. Example: Network Why? To fix topology low bisection width problem used inin normal tree topology Thinking Machine CM-5 CMPE655 - Shaaban #28 lec # 8 Fall 2015 11-10-2015
Embedding A Binary Tree Onto A 2D Mesh Embedding: In static networks refers to mapping nodes of one network (or task graph?) onto another network while attempting to minimize Graph Matching? extra hops. 8 H-Tree Configuration to embed binary tree onto a 2D mesh 1 8 3 6 5 9 10 11 9 2 12 1 6 13 3 Root 2 4 4 12 10 7 13 14 5 11 14 7 15 15 A Additional nodes added to form the tree (PP, Chapter 1.3.2) i.e Extra hops CMPE655 - Shaaban #29 lec # 8 Fall 2015 11-10-2015
Embedding A Ring Onto A 2D Torus The 2D Torus has a richer topology/connectivity than a ring, thus it can embed it easily without any extra hops needed Ring: 2D Torus: Node Degree 2 Diameter N/2 Links N Bisection 2 Node Degree 4 Diameter 2 k/2 Links 2N 2 k2 Bisection 2k Here N 16 Diameter 8 Links 16 Here k 4 Diameter 4 Links 32 Bisection 8 Extra Hops Needed? Also: Embedding a binary tree onto a Hypercube is done without any extra hops CMPE655 - Shaaban #30 lec # 8 Fall 2015 11-10-2015
Dynamic Connection Networks Switches are usually used to dynamically implement connection paths or virtual circuits between nodes instead of fixed point-topoint connections. Dynamic connections are established by configuring switches based on communication demands. e.g Such networks include: – Bus systems. e.g. Wireless Networks? Shared links/interconnects 1– Multi-stage Interconnection Networks (MINs): 2 Omega Network. Baseline Network Butterfly Network, etc. Switch Control – Single-stage Crossbar switch networks. 1 (one N x N large switch) O(N2) Complexity? 1 2 2 1 1 2 2 1 Inputs Outputs 2 3 1 2 2x2 Switch A possible MINS Building Block CMPE655 - Shaaban #31 lec # 8 Fall 2015 11-10-2015
Dynamic Networks Definitions Permutation networks: Can provide any one-to-one mapping between sources and destinations. Strictly non-blocking: Any attempt to create a valid connection succeeds. These include Clos networks and the crossbar. Wide Sense non-blocking: In these networks any connection succeeds if a careful routing algorithm is followed. The Benes network is the prime example of this class. Rearrangeably non-blocking: Any attempt to create a valid connection eventually succeeds, but some existing links may need to be rerouted to accommodate the new connection. Batcher's bitonic sorting network is one example. Blocking: Once certain connections are established it may be impossible to create other specific connections. The Banyan and Omega networks are examples of this class. Single-Stage networks: Crossbar switches are single-stage, strictly nonblocking, and can implement not only the N! permutations, but also the NN combinations of non-overlapping broadcast. CMPE655 - Shaaban #32 lec # 8 Fall 2015 11-10-2015
Dynamic Network Building Blocks: Crossbar-Based NxN Switches Input Ports Receiver Input Buffer Switch Fabric Complexity O(N2) N Cross-bar Output Buffer Transmiter Output Ports N Or implement in stages then complexity O(NLogN) Total Switch Routing Delay Control Routing, Scheduling Implemented using one large N x N switch or by using multiple stages of smaller switches CMPE655 - Shaaban #33 lec # 8 Fall 2015 11-10-2015
Switch Components Output ports: – Transmitter (typically drives clock and data). Input ports: – Synchronizer aligns data signal with local clock domain. – FIFO buffer. Crossbar: i.e switch fabric – Switch fabric connecting each input to any output. – Feasible degree limited by area or pinout, O(n2) complexity. Buffering (input and/or output). for n x n crossbar Control logic: – Complexity depends on routing logic and scheduling algorithm. – Determine output port for each incoming packet. – Arbitrate among inputs directed at same output. – May support quality of service constraints/priority routing. CMPE655 - Shaaban #34 lec # 8 Fall 2015 11-10-2015
Switch Size And Legitimate States Switch Size Connections Input size All Legitimate States (includes broadcasts) 22 2X2 4X4 8X8 nXn 4 4 256 16,777,216 nn 4 88 Output size Example: Four states for 2x2 switch Permutation (i.e only one-to-one mappings no broadcast connections) 2! 4! 8! 2 24 40,320 n! (2 broadcast connections) 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 (2 permutation connections) For n x n switch: Complexity O(n2) n number of input or outputs CMPE655 - Shaaban #35 lec # 8 Fall 2015 11-10-2015
Permutations AKA Bijections (one to one mappings) For n objects there are n! permutations by which the n objects can be reordered. The set of all permutations form a permutation group with respect to a composition operation. One can use cycle notation to specify a permutation function. For Example: Cycle The permutation ( a, b, c)( One d, e) a a stands for the bijection (one to one) mapping: b b a b, b c , c a , d e , e d c c in a circular fashion. d d The cycle ( a, b, c) has a period of 3 and the cycle (d, e) e e has a period of 2. Combining the two cycles, the permutation has a cycle period of 2 x 3 6. If one applies the permutation six times, the identity mapping I ( a) ( b) ( c) ( d) ( e) is obtained. CMPE655 - Shaaban #36 lec # 8 Fall 2015 11-10-2015
e.g. For N 8 Perfect Shuffle Perfect shuffle is a special permutation function suggested by Harold Stone (1971) for parallel processing applications. Inverse Perfect Shuffle: rotate Obtained by rotating the binary address one position left. binary address one position right The perfect shuffle and its inverse for 8 objects are shown here: 000 000 000 000 001 001 001 001 010 010 010 010 011 011 011 011 100 100 100 100 101 101 101 101 110 110 110 110 111 111 111 111 Inverse Perfect Shuffle Perfect Shuffle (circular shift left one position) CMPE655 - Shaaban #37 lec # 8 Fall 2015 11-10-2015
Generalized Structure of Multistage Interconnection Networks (MINS) Fig 2.23 page 91 Kai Hwang ref. See handout CMPE655 - Shaaban #38 lec # 8 Fall 2015 11-10-2015
Multi-Stage Networks (MINS) Example: ISC The Omega Network In the Omega network, perfect shuffle is used as an inter-stage connection (ISC) pattern for all log2N stages. N size of network Routing is simply a matter of using the destination's address bits to set switches at each stage. 2x2 switches used Log2 N stages The Omega network is a single-path network: There is just one path between an input and an output. It is equivalent to the Banyan, Staran Flip Network, Shuffle Exchange Network, and many others that have been proposed. The Omega can only implement NN/2 of the N! permutations between inputs and outputs in one pass, so it is possible to have permutations that cannot be provided in one pass (i.e. paths that can be blocked). – For N 8, there are 84/8! 4096/40320 0.1016 10.16% of the permutations that can be implemented in one pass. It can take log2N passes of reconfiguration to provide all links. Because there are log2 N stages, the worst case time to provide all desired connections can be (log2N)2. ISC patterns used define MIN topology/connectivity Here, ISC used for Omega network is perfect shuffle CMPE655 - Shaaban #39 lec # 8 Fall 2015 11-10-2015
Multi-Stage Networks: The Omega Network ISC Perfect Shuffle a b 2 (i.e 2x2 switches used) Node Degree 1 bi-directional link or 2 uni-directional links Diameter log2 N (i.e number of stages) Bisection width N/2 N/2 switches per stage, log2 N stages, thus: Complexity O(N log2 N) Fig 2.24 page 92 Kai Hwang ref. See handout (for figure) CMPE655 - Shaaban #40 lec # 8 Fall 2015 11-10-2015
MINs Example: Baseline Network Fig 2.25 page 93 Kai Hwang ref. See handout CMPE655 - Shaaban #41 lec # 8 Fall 2015 11-10-2015
MINs Example: Butterfly Network Constructed by connecting 2x2 switches doubling the connection distance at each stage Can be viewed as a tree with multiple roots 2 x 2 switch 4 0 3 Distance Doubles 0 0 1 0 1 0 1 1 1 2 1 0 Building block Example: N 16 Complexity: N/2 x log2N (# of switches in each stage x # of stages) i.e O(N log N) Exactly one route from any source to any destination node. R A XOR B, at level i use ‘straight’ edge if ri 0, otherwise cross edge Bisection width N/2 Complexity O(N log2 N) Diameter log2N Number of stages 2 N Number of nodes CMPE655 - Shaaban #42 lec # 8 Fall 2015 11-10-2015
Relationship Between Butterfly Network & Hypercubes Relationship: The connection patterns in the two networks are isomorphic (identical). – Except that Butterfly always takes log2n steps. CMPE655 - Shaaban #43 lec # 8 Fall 2015 11-10-2015
MIN Network Latency Scaling Example O(log2 N) Stage N-node MIN using 2x2 switches: i.e. # of stages Max distance: log2 N (good latency scaling) Cost or Complexity O(N log2 N) Number of switches: N log N (good complexity scaling) overhead o 1 us, BW 64 MB/s, 200 ns per hop Switching/routing delay per hop Using pipelined or cut-through routing: T64(128) 1.0 us 2.0 us 6 hops * 0.2 us/hop 4.2 us T1024(128) 1.0 us 2.0 us 10 hops * 0.2 us/hop 5.0 us N 64 nodes N 1024 nodes Only 20% increase in latency for 16x network size increase Message size n 128 bytes Store and Forward h n/B Good latency scaling T64sf(128) 1.0 us 6 hops * (2.0 0.2) us/hop 14.2 us T1024sf(128) 1.0 us 10 hops * (2.0 0.2) us/hop 23 us o N 64 nodes N 1024 nodes 60% increase in latency for 16x network size increase Latency when sending n 128 bytes for N 64 and N 1024 nodes CMPE655 - Shaaban #44 lec # 8 Fall 2015 11-10-2015
Summary of Static Network Characteristics Table 2.2 page 88 Kai Hwang ref. See handout CMPE655 - Shaaban #45 lec # 8 Fall 2015 11-10-2015
Summary of Dynamic Network Characteristics Table 2.4 page 95 Kai Hwang ref. See handout CMPE655 - Shaaban #46 lec # 8 Fall 2015 11-10-2015
Example Networks: Cray MPPs Both networks used in T3D and T3E are: Point-to-point (static) using the 3D Torus topology Distributed Memory SAS T3D: Short, Wide, Synchronous (300 MB/s). – 3D bidirectional torus up to 1024 nodes, dimension order, virtual cut-through, packet switched routing. – 24 bits: 16 data, 4 control, 4 reverse direction flow control – Single 150 MHz clock (including processor). – flit phit 16 bits. – Two control bits identify flit type (idle and framing). No-info, routing tag, packet, end-of-packet. T3E: long, wide, asynchronous (500 MB/s) – 14 bits, 375 MHz – flit 5 phits 70 bits 64 bits data 6 control – Switches operate at 75 MHz. – Framed into 1-word and 8-word read/write request packets. CMPE655 - Shaaban #47 lec # 8 Fall 2015 11-10-2015
Parallel Machine Network Examples 1/f W or Phit i.e basic unit of flow-control (frame size) CMPE655 - Shaaban #48 lec # 8 Fall 2015 11-10-2015