Evaluating the impact of high-bandwidth memory on MPI ...

Evaluating the impact of high-bandwidth memory on MPI ...

Locality-Aware PMI Usage for Efficient MPI Startup Kenneth Raffenetti [email protected]* Neelima Bayyapu [email protected]* Dimitry Durnov [email protected]# Masamichi Takagi [email protected]~ Pavan Balaji [email protected]* * Mathematics and Computer Science Division Argonne National Laboratory # Intel Corporation ~ RIKEN Center for Computational Science Agenda Introduction Background Process Manager Process Management Interface Motivation Related Work Methodology Simple Address Exchange Shared-Memory Optimization for Address Exchange MPI Collective Optimization for Address Exchange Evaluation Results and Analysis

Conclusions Kenneth Reffenetti ICCC 2018, Chengdu 12/08/2018 2 Introduction MPI is the de facto standard programming model for distributed memory systems MPI needs to be initialized by the application Initialization process needs to be more efficient Initialization tasks include Information gathering about the parallel job Setting-up internal library state Preparing resources For performance, external information is exchanged up front during MPI_Init rather than during subsequent communication calls Processes utilize the Process Management Interface (PMI) for fabric address exchange PMI usage needs to be improved to reduce the initialization time of large-scale jobs Kenneth Reffenetti ICCC 2018, Chengdu 12/08/2018: 3

Background Process Manager Handles the start an stop of processes Acts as a central coordination point for parallel processes Process Management Interface First introduced in MPICH (an MPI implementation) By decoupling the process management functionality from the underlying process Provides on key-value store Motivation Increased number of node and core count Need of quick and efficient coordination across ranks Kenneth Reffenetti PMI Key-value store Put Get Rank #0 Rank #1 Rank #2

Node #0 ICCC 2018, Chengdu 12/08/2018 Rank #3 Node #1 4 Related Work Defining API standard PMI features and capabilities in MPICH [1] Optimizations to the PMI functionality PMI data exchange over the HPC fabric [5] Nonblocking APIs [6] PMI proxy communication over shared memory [7] Scalability extensions for PMI PMI for Exascale [4] Kenneth Reffenetti ICCC 2018, Chengdu 12/08/2018 5 Methodology (1/3) Simple Address Exchange

Writes its address or business card into KVS Retrieves business cards of all /* All ranks performs followings*/ PMI_KVS_Put(rank, myaddr); PMI_KVS_Barrier(); for (i = 0; i < size; i++) PMI_KVS_Get(i, &addrs[i]); other processes after a barrier PMI Key-value Store O(P2) algorithm At scale, cost is noticeable 250 Simple Address Exchange Performance Seconds 200 150 PMI Key-value Store 100 50

0 1 2 4 8 16 32 64 128 256 Nodes (ppn=64) Kenneth Reffenetti ICCC 2018, Chengdu 12/08/2018 6 Methodology (2/3) Shared Memory Optimization for Address Exchange PMI Key-value Store Redundant work on each node to be removed By using shared memory Within MPI, processes learn about onnode and off-node processes The overheads of shared memory communication and size of address

data needs to be addressed PMI Key-value Store Used global maximum across the nodes for address data Amount of data fetched is reduced from O(P2) to O(N*P). Kenneth Reffenetti ICCC 2018, Chengdu 12/08/2018 7 Methodology (3/3) MPI Collective Optimization for Address Exchange PMI Key-value Store Passing the address data through the PMI database needs to be optimized Used MPI collective communications (MPI_Allgather) that are localityaware Used node root to reduce the amount of traffic Direct communication among peers PMI Key-value Store

Significant reduction in cost Kenneth Reffenetti ICCC 2018, Chengdu 12/08/2018 8 Evaluation Results and Analysis (1/4) Experimental Setup Theta Supercomputer 11.69 petaflop system Based on Intel Xeon Phi 7230 processors coupled with a Cray Aries interconnect in Dragonfly topology Equipped with 4,392 nodes, each with 64 cores Bebop Supercomputer 1024 nodes 64 cores (Intel Knights Landing) per compute node with Intel Omni-Path fabric Experiments Performance evaluation of address exchange on Bebop Performance evaluation of address exchange on Theta Kenneth Reffenetti ICCC 2018, Chengdu 12/08/2018

9 Evaluation Results and Analysis (2/4) Address Exchange by Node Count on Bebop Seconds Original address exchange takes on the order of minutes at scale At 256 nodes (ppn=64), node-roots method takes less than 2 seconds 10 9 8 7 6 5 4 3 2 1 0 13.279052 allgather bc exchange bc max

shm setup 26.4526612 52.157150 2 105.328247 211.2465936 Nodes (ppn=64) Kenneth Reffenetti ICCC 2018, Chengdu 12/08/2018 10 Evaluation Results and Analysis (3/4) Address Exchange by Process Count on Bebop Seconds At ppn <= 2, there is a little overhead of shared-memory At ppn > 2, the proposed optimizations outperform the traditional method 10

9 8 7 6 5 4 3 2 1 0 12.452322 allgather bc exchange bc max 50.9093674 211.2465963 shm Processes Per Node (256 Nodes) Kenneth Reffenetti

ICCC 2018, Chengdu 12/08/2018 11 Evaluation Results and Analysis (4/4) Address Exchange by node count on Theta shm setup phase max phase PMI bc exchange phase allgather phase 25 102.9865258 421.847700 2 20 Seconds 15 10 5

0 Nodes Kenneth Reffenetti ICCC 2018, Chengdu 12/08/2018 12 Conclusions Efficient startup time becomes important, as HPC systems grow This work looked at the most expensive part of MPI initialization Address exchange using the Process Management Interface (PMI) Address exchange performance is improved with locality information By using shared memory, redundant work is eliminated By using MPI collective communications, we enabled the high-speed fabric Kenneth Reffenetti ICCC 2018, Chengdu 12/08/2018 13 References

1. P. Balaji, D. Buntinas, D. Goodell, W. D. Gropp, J. Krishna, E. L. Lusk, and R. Thakur, Pmi: A scalable parallel process-management interface for extreme-scale systems, in 17th EuroMPI Conference, Lecture Notes in Computer Science, Springer, 11/2009 2009. 2. MPICH, https://www.mpich.org/, 2018. 3. Top500, https://www.top500.org/, 2018. 4. R. H. Castain, D. Solt, J. Hursey, and A. Bouteiller, Pmix: Process management for exascale environments, in Proceedings of the 24th European MPI Users Group Meeting, ser. EuroMPI 17, 2017, pp. 14:1 14:10. 5. S. Chakraborty, H. Subramoni, J. Perkins, A. Moody, M. Arnold, and D. Panda, Pmi extensions for scalable mpi startup, in 21st European MPI Users Group Meeting, EuroMPI/ASIA 14, Kyoto, Japan - Septem- ber 09 - 12, 2014, September 2014. 6. S. Chakraborty, H. Subramoni, A. Moody, A. Venkatesh, J. Perkins, and D. Panda, Nonblocking pmi extensions for fast mpi startup, in Cluster, Cloud and Grid Computing (CCGrid),

2015 15th IEEE/ACM International Symposium on, May 2015. 7. S. Chakraborty, H. Subramoni, J. L. Perkins, and D. K. Panda, Shmempmi shared memory based pmi for improved performance and scalability, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 6069, 2016. Kenneth Reffenetti ICCC 2018, Chengdu 12/08/2018 14 Thank you! Please email any question to the authors: [email protected] or [email protected] or [email protected] or [email protected] or [email protected] Kenneth Reffenetti ICCC 2018, Chengdu 12/08/2018 15

Recently Viewed Presentations

  • Central Receiving Center (CRC) Phase Two

    Central Receiving Center (CRC) Phase Two

    Central point of access for integrated assessment for involuntary mental health (Baker Act) and substance abuse (Marchman Act). Jail Diversion for Law Enforcement use in Orange County . State of the art facility, only one of its kind in the...
  • How does this go?

    How does this go?

    Creating resources & tools, maintaining the Web presence, & supporting teachers using resources
  • Hydroforming a Steel Tube Design Optimization

    Hydroforming a Steel Tube Design Optimization

    Hydro-Forming of a Steel Tube Background Model Creation Model Limitations Contact elements Load stepping Findings Future Work Conclusion Background Sheet Hydro-Forming Hoods Roofs Tubular Hydro-Forming Engine chassis Frame Rails Exhaust Systems Primer: Tube Hydroforming Concerns During Hydroforming Process Focus of...
  • What is Post Keynesian Economics? - Debt deflation

    What is Post Keynesian Economics? - Debt deflation

    What is Post Keynesian Economics? So is Post-Keynesian economics… "70 or 100 year old historical ways of thinking about the economy when the economy has changed so much" Or… A different approach to economics inspired by a similar crisis &...
  • Updates &amp; Upcoming Events - Pennsylvania Small Business ...

    Updates & Upcoming Events - Pennsylvania Small Business ...

    Updated Local Area Expansion (LEA) and Multi-state CDC guidance . 81-82 UPDATED. SOP 50 10 5 (K) Updates for 504 only. Subpart C. Escrow closing guidance . ... Brad Currie, Senior Area Manager, Southern Maine. 207-780-1013; [email protected] Bill Card, Economic...
  • BATS!

    BATS!

    Tiny woolly bats of West Africa live in the large webs of spiders. The Honduran white bat cuts large leaves to make "tents" that protect its colonies from jungle rains. Bat Caves! Some caves may be home to thousands of...
  • Welcome to the AML LI1 initiation

    Welcome to the AML LI1 initiation

    Welcome to the AML18 initiation Version 1.2 July 2014
  • Manual Handling Training for The Disability Sector

    Manual Handling Training for The Disability Sector

    Use correct manual handling techniques. Report any hazard, equipment fault or injury. Perform day to day care of equipment provided for manual handling e.g. recharge batteries , oil wheels, etc. Now what would you think are your responsibilities under the...