Comparing The Performance Of Distributed Shared Memory And

70 Slides197.00 KB

Comparing The Performance Of Distributed Shared Memory And Message Passing Programs Using The Hyperion Java Virtual Machine On Clusters Mathew Reno M.S. Thesis Defense

Overview For this thesis we wanted to evaluate the performance of the Hyperion distributed virtual machine, designed at UNH, when compared to a preexisting parallel computing API. The results would indicate where Hyperion’s strength and weaknesses were and possibly validate Hyperion as a high-performance computing alternative. M.S. Thesis Defense

What Is A Cluster? A cluster is a group of low cost computers connected with an “off the shelf” network. The cluster’s network is isolated from WAN data traffic and the computers on the cluster are presented to the user as a single resource. M.S. Thesis Defense

Why Use Clusters? Clusters are cost effective when compared to traditional parallel systems. Clusters can be grown as needed. Software components are based on standards allowing portable software to be designed for the cluster. M.S. Thesis Defense

Cluster Computing Cluster computing takes advantage of the cluster by distributing computational workload among nodes of the cluster, thereby reducing total computation time. There are many programming models for distributing data throughout the cluster. M.S. Thesis Defense

Distributed Shared Memory Distributed Shared Memory (DSM) allows the user to view the whole cluster as one resource. Memory is shared among the nodes. Each node has access to all other nodes memory as if it owns it. Data coordination among nodes is generally hidden from the user. M.S. Thesis Defense

Message Passing Message Passing (MP) requires explicit messages to be employed to distribute data throughout the cluster. The programmer must coordinate all data exchanges when designing the application through a language level MP API. M.S. Thesis Defense

Related: Treadmarks Vs. PVM Treadmarks (Rice, 1995) implements a DSM model while PVM implements a MP model. The two approaches were compared with benchmarks. On average, PVM was found to perform two times better the Treadmarks. Treadmarks suffered from excessive messages that were required for the request response communication DSM model employed. Treadmarks was found to be more natural to program with saving development time. M.S. Thesis Defense

Hyperion Hyperion is a distributed Java Virtual Machine (JVM), designed at UNH. The Java language provides parallelism through its threading model. Hyperion extends this model by distributing the threads among the cluster. Hyperion implements the DSM model via DSM-PM2, which allows for lightweight thread creation and data distribution. M.S. Thesis Defense

Hyperion, Continued Hyperion has a fixed memory size that it shares with all threads executing across the cluster. Hyperion uses page based data distribution; if a thread accesses memory it does not have locally, a page fault occurs and the memory is transmitted from the node that owns the memory to the requesting node a page at a time. M.S. Thesis Defense

Hyperion, Continued Hyperion translates Java bytecodes into native C code. A native executable is generated by a native C compiler. The belief is that native executables are optimized by the C compiler and will benefit the application by executing faster than interpreted code. M.S. Thesis Defense

Hyperion’s Threads Threads are created in a round robin fashion among the nodes of the cluster. Data is transmitted between threads via a request/response mechanism. This approach requires two messages. In order to respond to a request message, a response thread must be scheduled. This thread handles the request by sending back the requested data in a response message. M.S. Thesis Defense

mpiJava mpiJava is a Java wrapper for the Message Passing Interface (MPI). The Java Native Interface (JNI) is used to translate between Java and native code. We used MPICH for the native MPI implementation. M.S. Thesis Defense

Clusters The “Star” cluster (UNH) consists of 16 PIII 667MHz Linux PCs on a 100Mb Fast Ethernet network. TCP is communication protocol. The “Paraski” cluster (France) consists of 16 PIII 1GHz Linux PCs on a 2Gb Myrinet network. BIP (DSM) and GM (MPI) are the communication protocols. M.S. Thesis Defense

Clusters, Continued The implementation of MPICH on BIP was not stable in time for this thesis. GM had to be used in place of BIP for MPICH. GM has not been ported to Hyperion and a port would be unreasonable at this time. BIP performs better than GM as the message size increases. M.S. Thesis Defense

BIP vs. GM Latency (Paraski) Roundtrip Latency (μs) 600 500 400 300 200 100 0 0 1 2 3 4 5 6 Message Size (KB) MPI (GM) M.S. Thesis Defense PM2 (BI P) 7 8

DSM & MPI In Hyperion For consistency, mpiJava was ported into Hyperion. Both DSM and MPI versions of the benchmarks could be compiled by Hyperion. The executables produced by Hyperion are then executed by the respective native launchers (PM2 and MPICH). M.S. Thesis Defense

Benchmarks The Java Grande Forum (JGF) developed a suite of benchmarks to test Java implementations. We used two of the JGF benchmark suites, multithreaded and javaMPI. M.S. Thesis Defense

Benchmarks, Continued Benchmarks used: Fourier coefficient analysis Lower/upper matrix factorization Successive over-relaxation IDEA encryption Sparse matrix multiplication Molecular dynamics simulation 3D Ray Tracer Monte Carlo simulation (only with MPI) M.S. Thesis Defense

Benchmarks And Hyperion The multi-threaded JGF benchmarks had unacceptable performance when run “out of the box”. Each benchmark creates all of its data objects on the root node causing all remote object access to occur through this one node. This type of access causes a performance bottleneck on the root node as it has to service all the requests while calculating its algorithm part. The solution was to modify the benchmarks to be cluster aware. M.S. Thesis Defense

Hyperion Extensions Hyperion makes up for Java’s limited thread data management by providing efficient reduce and broadcast mechanisms. Hyperion also provides a cluster aware implementation of arraycopy. M.S. Thesis Defense

Hyperion Extension: Reduce Reduce blocks all enrolled threads until each thread has the final result of the reduce. This is done by neighbor threads exchanging their data for computation, then their neighbors, and so on until each thread has the same answer. This operation is faster and scales well as opposed to performing the calculation serially. The operation is LogP. M.S. Thesis Defense

Hyperion Extension: Broadcast The broadcast mechanism transmits the same data to all enrolled threads. Like reduce, data is distributed to the threads in a LogP fashion, which scales better than serial distribution of data. M.S. Thesis Defense

Hyperion Extension: Arraycopy The arraycopy method is part of the Java System class. The Hyperion version was extended to be cluster aware. If data is copied across threads, this version will send all data as one message instead of relying on paging mechanisms to access remote array data. M.S. Thesis Defense

Benchmark Modifications The multithreaded benchmarks had unacceptable performance. The benchmarks were modified in order to reduce remote object access and root node bottlenecks. Techniques, such as arraycopy, broadcast and reduce were employed to improve performance. M.S. Thesis Defense

Experiment Each benchmark was executed 50 times at each node size to provide a sample mean. Node sizes were 1, 2, 4, 8, and 16. Confidence intervals (95% level) were used to determine which version, MPI or DSM, performed better. M.S. Thesis Defense

Performance Ratio (DSM/ MPI ) Results On The Star Cluster 4 3.5 3 Crypt LUFact Moldyn Series SOR Sparse RayTracer 2.5 2 1.5 1 0.5 0 1 2 4 8 Number of Nodes M.S. Thesis Defense 16

Performance Ratio (DSM/ MPI ) Results On The Paraski Cluster 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Crypt LUFact Moldyn Series SOR Sparse RayTracer 1 2 4 8 Number of Nodes M.S. Thesis Defense 16

Fourier Coefficient Analysis Calculates the first 10,000 pairs of Fourier coefficients. Each node is responsible for calculating its portion of the coefficient array. Each node sends back its array portion to the root node, which accumulates the final array. M.S. Thesis Defense

Fourier: DSM Modifications The original multithreaded version required all threads to update arrays located on the root node, causing the root node to be flooded with requests. The modified version used arraycopy to distribute the local arrays back to the root thread arrays. M.S. Thesis Defense

Fourier: mpiJava The mpiJava version is similar to the DSM version. Each process is responsible for its portion of the arrays. MPI Ssend and MPI Recv were called to distribute the array portions to the root process. M.S. Thesis Defense

Fourier: Results Star Paraski 15 Seconds Seconds 20 10 5 0 1 2 4 8 16 Nodes MPI 12 10 8 6 4 2 0 1 2 4 8 Nodes DSM M.S. Thesis Defense MPI DSM 16

Fourier: Conclusions Most of the time in this benchmark is spent in the computation. Network communication does not play a significant role in the overall time. Both MPI and DSM perform similar on each cluster, scaling well when more nodes are added. M.S. Thesis Defense

Lower/Upper Factorization Solves a 500 x 500 linear system with LU factorization followed with a triangular solve. The factorization is parallelized while the triangular solve is computed in serial. M.S. Thesis Defense

LU: DSM Modifications The original version created the matrix on the root thread and all access was through this thread, causing performance bottlenecks. The benchmark was modified to use Hyperion’s Broadcast facility to distribute the pivot information and arraycopy was used to coordinate the final data for the solve. M.S. Thesis Defense

LU: mpiJava MPI Bcast is used to distribute the pivot information. MPI Send and MPI Recv are used so the root process can acquire the final matrix. M.S. Thesis Defense

LU: Results Paraski 6 5 4 3 2 1 0 4 Seconds Seconds Star 3 2 1 0 1 2 4 8 16 Nodes MPI 1 2 4 8 Nodes DSM M.S. Thesis Defense MPI DSM 16

LU: Conclusions While the DSM version uses a similar data distribution mechanism as the MPI version, there is significant overhead that is exposed when executing these methods in large loops. This overhead is minimized on the Paraski cluster due to the nature Myrinet and BIP. M.S. Thesis Defense

Successive OverRelaxation Performs 100 iterations of SOR on a 1000 x 1000 grid. A “red-black” ordering mechanism allows array rows to be distributed to nodes in blocks. After initial data distribution, only neighbor rows need be communicated during the SOR. M.S. Thesis Defense

SOR: DSM Modifications Excessive remote thread object access made it necessary to modify the benchmark. Modified version uses arraycopy to update neighbor rows during the SOR. When the SOR completes, arraycopy is used to assemble the final matrix on the root thread. M.S. Thesis Defense

SOR: mpiJava MPI Sendrecv is used to exchange neighbor rows. MPI Ssend and MPI Recv are used to build the final matrix on the root process. M.S. Thesis Defense

SOR: Results Star Paraski 15 15 Seconds Seconds 20 10 5 0 10 5 0 1 2 4 8 16 Nodes MPI 1 2 4 8 Nodes DSM M.S. Thesis Defense MPI DSM 16

SOR: Conclusions The DSM version requires an extra barrier after row neighbors are exchanged due to the “network reactivity” problem. A thread must be able to service all requests in a timely fashion. If the thread is busy computing, it cannot react quickly enough to schedule the request thread. The barrier will block all threads until each reaches the barrier, which guarantees that all nodes have their requested data and it is safe to continue with the computation. M.S. Thesis Defense

IDEA Crypt Performs IDEA encryption and decryption on a 3,000,000 byte array. The array is divided among nodes in a block manner. Each node encrypts and decrypts its portion. When complete, the root node collects the decrypted array for validation. M.S. Thesis Defense

Crypt: DSM Modifications The original created the whole array on the root thread and required each remote thread to page in their portions. The modified version used arraycopy to distribute each threads portion from the root thread. When decryption finishes, arraycopy copies back the decrypted portion to the root thread. M.S. Thesis Defense

Crypt: mpiJava The mpiJava version uses MPI Ssend to send the array portions to the remote processes and MPI Recv to receive the portions. When complete, MPI Ssend is used to send back the processes portion and MPI Recv receives each portion. M.S. Thesis Defense

Crypt: Results Paraski 5 4 4 Seconds Seconds Star 3 2 1 0 3 2 1 0 1 2 4 8 16 Nodes MPI 1 2 4 8 Nodes DSM M.S. Thesis Defense MPI DSM 16

Crypt: Conclusions Results are similar on both clusters. There is a slight performance problem with 4 and 8 nodes with the DSM version. This can be attributed to the placing of a barrier that causes all threads to block before computing, in the DSM version, while the MPI version does not block. M.S. Thesis Defense

Sparse Matrix Multiplication A 50,000 x 50,000 unstructured matrix stored in compressed-row format is multiplied over 200 iterations. Only the final result is communicated as each node has its own portion of data and initial distribution is not timed. M.S. Thesis Defense

Sparse: DSM Modifications This benchmark originally produced excessive network traffic through remote object access. The modifications involved removing the object access during the multiplication loop and using arraycopy to distribute the final result to the root thread. M.S. Thesis Defense

Sparse: mpiJava This benchmark’s only communication is an MPI Allreduce, which performs a array reduce leaving the result on all nodes. This is employed to obtain the final result of the multiplication. M.S. Thesis Defense

Sparse: Results Paraski 25 20 20 Seconds Seconds Star 15 10 5 0 15 10 5 0 1 2 4 8 16 Nodes MPI 1 2 4 8 Nodes DSM M.S. Thesis Defense MPI DSM 16

Sparse: Conclusions The results are similar on both clusters. The DSM version has better performance on both cluster. The MPI version uses the MPI Allreduce method that places the results on all nodes. This method adds extra overhead that is not present in the DSM version, where the results are just collected on the root node. M.S. Thesis Defense

Molecular Dynamics This benchmark is a 2048-body code that models particles interacting under a Lennard-Jones potential in a cubic spatial volume with periodic boundary conditions. Parallelization is provided by dividing the range of iterations over the particles among nodes. M.S. Thesis Defense

MolDyn: DSM Modifications Significant amount of remote thread object access necessitated modifications. Particle forces are updated by remote threads using arraycopy to send the forces to the root thread and the root thread serially updates the forces and sends the new force array back to the remote threads via arraycopy. M.S. Thesis Defense

MolDyn: mpiJava This version uses six MPI Allreduce commands to update various force and movement arrays. This occurs at every time step. M.S. Thesis Defense

MolDyn: Results Paraski 30 25 20 15 10 5 0 20 Seconds Seconds Star 15 10 5 0 1 2 4 8 16 Nodes MPI 1 2 4 8 Nodes DSM M.S. Thesis Defense MPI DSM 16

MolDyn: Conclusions The DSM version must update particle forces serially on the root thread. This causes all threads to block while sending its local forces to the root thread and wait for the updated forces to be sent back. The MPI version uses MPI Allreduce to efficiently compute the forces among all the processes in a parallel fashion causing nodes to only block for its neighbor force updates. M.S. Thesis Defense

Ray Tracer A scene with 64 spheres is rendered at 150 x 150 pixels. Parallelization is provided by a cyclic distribution for load balancing when looping over the rows of pixels. M.S. Thesis Defense

Ray Tracer: DSM Modifications This benchmark was poorly designed as it created far too many temporary objects. The DSM version was heavily modified to eliminate temporary object creation. Broadcast is used to transmit the array row reference to each thread. Once the rendering is complete, arraycopy is used to copy the row data to the root thread. Reduce is used to compute the pixel checksum for validation purposes. M.S. Thesis Defense

Ray Tracer: mpiJava The mpiJava version was also modified to remove temporary object creation. MPI Reduce is used to generate the pixel checksum. MPI Send and MPI Recv are used to transmit the row data to the root process. M.S. Thesis Defense

Ray Tracer: Results Paraski 30 25 20 15 10 5 0 20 Seconds Seconds Star 15 10 5 0 1 2 4 8 16 Nodes MPI 1 2 4 8 Nodes DSM M.S. Thesis Defense MPI DSM 16

Ray Tracer: Conclusions The results on both nodes are almost identical. Very little network communication is required. Most of the time is spent rendering the scene. M.S. Thesis Defense

Monte Carlo Uses Monte Carlo techniques to price products derived from the worth of an underlying asset. Generates 2,000 samples. Parallelization is provided by dividing the work in the principle loop of the Monte Carlo runs in block fashion and distributing the blocks to remote nodes. M.S. Thesis Defense

Monte Carlo: Lack Of DSM This benchmark required an unacceptable amount of memory for Hyperion to handle. It is embarrassingly parallel and we have other embarrassingly parallel benchmarks. We felt that it was unnecessary to develop a working DSM version from this large set of code. M.S. Thesis Defense

Monte Carlo: mpiJava The mpiJava version provided some insight into the mpiJava to Hyperion port, specifically in the object serialization portion. Monte Carlo relies heavily on object serialization as Java objects are distributed via send and receive commands instead of primitive types. By creating a working version of Monte Carlo, we were able to eliminate many subtle mpiJava to Hyperion port bugs. M.S. Thesis Defense

Monte Carlo: Results Star Paraski 60 Seconds Seconds 80 40 20 0 1 2 4 8 16 60 50 40 30 20 10 0 1 2 4 Nodes Nodes MPI MPI M.S. Thesis Defense 8 16

Monte Carlo: Conclusions The MPI version scales well on both clusters. Without a DSM version, a comparison is not possible. M.S. Thesis Defense

Conclusions The Hyperion system can compete with traditional parallel programming models. However, to compete, a Hyperion user cannot simply write multi-threaded Java code and expect it to perform well on a cluster. Users must be aware of how thread creation works in Hyperion and the effects of remote object access in order to tune their applications. M.S. Thesis Defense

Conclusions, Continued With an application developed using Hyperion’s thread management and data exchange techniques (reduce, broadcast, arraycopy), a Hyperion user can achieve competitive performance. We feel that methods for performing operations on groups of threads, such as the above, should be part of the Java threading API, as they could be useful even outside a parallel environment. M.S. Thesis Defense

Back to top button