ALGORITHM IMPROVEMENTS AND HPC 1 Basic MD algorithm 1. Initialize atoms status and the LJ potential table; set parameters controlling the simulation; O(N) 2. For all time steps do 3. Update positions of all atoms; O(N) 4. If there are atoms who have moved too much, do Update the neighbour list, including all atom pairs that are within a distance range (half distance matrix computation); O(N2)

End if 5. Make use of the neighbour list to compute forces acted on all atoms; O(N) 6. Update velocities of all atoms; O(N) 7. Update the displace list, which contains the displacements of atoms; O(N) 8. Accumulate and output target statistics of each time step; O(N) 9. End for 2 Some general ideas

MD basic algorithm has not changed the last 25 years Most major MD codes use similar approaches for optimization All trivial tricks have been already applied!! No major breakthrough except perhaps GPUs and Anton MD improvement is largely ligated to the improvement of hardware resources Can we use efficiently a present supercomputer for MD? 3 Improvement directions Floating-point calculations Precision, Custom Maths

Non-Bonded interactions PBC, Common interactions Stream processing & GPUs Integration steps Multiple time steps, Constrains, Virtual interactions sites Parallelization. Spatial decomposition Asynchronous parallelization Replica exchange, Markov state models P. Larsson, B. Hess & E. Lindahl. WIRE Comp. Mol. Sci. 2011, 1, 93-108 Floating point calculations MD limiting step is the calculation of Non-bonded interactions

Floating-point operations Random memory access Move to simple precision Problem for accumulated numerical error or for too small position displacements However, force calculations require only a relative error < 10-3 Small effect in the calculation itself in modern processors Improves use cache, SIMD register Math routines in single precision are more efficient Custom Math routines MD does not require maths to be IEEE complain!! Some internal checks can be avoided (ex. positive parameter in sqrt )

5 Non-bonded interactions Avoid O(N2) Maintain list of NB pairs within a distance cutoff Updated at larger time steps Caution, too short cutoff distance give wrong behaviour Domain decomposition (more later) PBC requires to check for the closest image Simple approach with 6 conditionals kill modern processors Cut-off sphere partitioning method

Maintains information about all 26 copies of the particle, updated with the NB list. Grid based NB pair list 6 Specialized interactions Some interactions are more abundant than others WAT-WAT dominate NB interactions list Water models SPC, TIP3P, TIP4P Rigid, only VdW in O, reduced set of charges The number of interactions to check is largely reduced

Only oxygens should be considered in the NB Lists A recent Charmm implementation uses a combined Coulomb + Lennard-Jones potential for Wat-Wat interactions 7 Streaming & GPUs GPUs have become popular in simulation field.

One order of magnitude faster that comparable CPUs Requires reformulate algorithms Usually the approach is not compatible with others SIMD approach, No random access to memory, no persistent data CPU is responsible of the main algorithm, GPU performs selected parts Update the NB list, Evaluate interactions, Update coordinates & velocities GPU could be viewed as a multicore device executing a high number of threads in parallel Threads are grouped in blocks that share memory. Memory distribution could be limitation Atom based decomposition is preferred

8 Use of GPUs 9 Integration steps Multiple time steps Numerical integrations require that the time step is kept below the fastest movement in the system Normally 1 femtosecond (109 iterations to reach biologically meaningful time scales) Most parts of the calculation do not change significatively in 1 fs.

GROMACS style: 1fs bond vibrations, 2fs NB interactions, 4fs PME (long range interactions) Tenths fs (NB list update, temperature and pressure update) Increasing the granularity of the minimal calculation (ex. in coarse-grained) will allow longer time steps 10 Integration steps. Constrains Constrains Some movements are not relevant for global flexibility Algorithms like SHAKE, RATTLE, M-SHAKE, LINCS, update bond distances by direct minimization

Constrains are very efective but limit parallelization Usually restricted to bonds with hydrogen atoms Virtual interaction sites Hydrogen atoms are replaced by virtual atoms Possible to reach 4-5 fs 11 MD Parallelization MD parallelization uses MPI Heavily limited by communication bandwith Modern multicore processors will allow to combine with openMP Simplest strategy is Atoms/Force decomposition

Atoms are distributed among processing units Bad and unpredictable load balance Requires all-to-all update of coordinates Partially improved by distribution NB list instead of atoms Scales only 8-16 processors Distribution should be dynamic: Spatial decomposition Space regions (not atoms) are distributed among processing units Assignment of atoms is dynamic, depending on their position in space

12 Spatial decomposition Model to use depends on hardware communication model Eight shell for single network (classical Linux cluster) Midpoint for full-duplex connections (Cray, BlueGene) Pulses from on processor to the next in each geometrical direction diminish communication calls for long range interactions 13 Spatial decomposition

Load balancing is the major efficiency problem Cells need not to be equal, and possibly updated dynamically 14 Asyncronous parallelization Any parallelization method become useless with less of ca. 100 atoms per processing unit Present computers can have thousands of cores but most usual simulations have around 100,000 atoms !!! A single long simulation can be replaced by a series of shorter ones.

Same statistics Sampling is improved as initial structures could be different In most cases simulations are completely independent (embarrisingly parallel) A single supercomputer is not really needed Distributed computing Grid [email protected] 15 Replica Exchange

P S S ' = min1, exp m n Em En Markov State Model Kinetic clustering of states Trajectory defines different ensembles separated by energy barriers Markov models used to predict transitions Trained with a large number of short simulations 17 Conclusions

Major MD codes uses a combination of optimization methods Current routine tends to be GPU based calculations, but combinations MPI + openMP are being considered. GPU + MPI + approaches are being considered but are highly dependent on hardware evolution. No parallelization schema will survive next generation of supercomputers 18 MD & HPC future? An ideal approach would for instance be to run a simulation of an ion channel system over 1000 cores in parallel, employing,

e.g., 1000 independent such simulations for kinetic clustering, and finally perhaps use this to screen thousands of ligands or mutations in parallel (each setup using 1,000,000 cores) P. Larsson, B. Hess & E. Lindahl. WIRE Comp. Mol. Sci. 2011, 1, 93-108 19