CoMD is a reference molecular dynamics simulation code as used in materials science.
The reference problem is solid Copper starting from a face-centered cubic (FCC) lattice. The initial thermodynamic conditions (Temperature and Volume (via the lattice spacing, lat))can be specified from the command line input. The default is 600 K and standard volume (lat = 3.615 Angstroms). Different temperatures (e.g. T =3000K) and volumes can be specified to melt the system and enhance the interchange of atoms between domains.
The dynamics is micro-canonical (NVE = constant Number of atoms, constant total system Volume, and constant total system Energy). As a result, the temperature is not fixed. Rather, the temperature will adjust from the initial temperature (as specified on the command line) to a final temperature as the total system kinetic energy comes into equilibrium with the total system potential energy.
The total size of the problem (number of atoms) is specified by the number (nx, ny, nz) of FCC unit cells in the x, y, z directions: nAtoms = 4 * nx * ny * nz. The default size is nx = ny = nz = 20 or 32,000 atoms.
The simulation models bulk copper by replicating itself in every direction using periodic boundary conditions.
Two interatomic force models are available: the Lennard-Jones (LJ) two-body potential (ljForce.c) and the many-body Embedded-Atom Model (EAM) potential (eam.c). The LJ potential is included for comparison and is a valid approximation for constant volume and uniform density. The EAM potential is a more accurate model of cohesion in simple metals like Copper and includes the energetics necessary to model non-uniform density and free surfaces.
CoMD implements a simple geometric domain decomposition to divide the total problem space into domains, which are owned by MPI ranks. Each domain is a single-program multiple data (SPMD) partition of the larger problem.
Caution: When doing scaling studies, it is important to distinguish between the problem setup phase and the problem execution phase. Both are important to the workflow of doing molecular dynamics, but it is the execution phase we want to quantify in the scaling studies described below, for that dominates the execution time for long runs (millions of time steps). The problem setup can be an appreciable fraction of the execution time for short runs (the default is 100 time steps) and erroneous conclusions drawn.
This code is configured with timers. The times are reported per particle and the timers for the force calculation, timestep, etc start after the initialization phase is done.
A weak scaling test fixes the amount of work per processor and compares the execution time over number of processors. Weak scaling keeps the ratio of inter-processor communication (surface) to intra-processor work (volume) fixed. The amount of inter-processor work scales with the number of processors in the domain and O(1000) atoms per domain are needed for reasonable performance.
Examples,
In general, it is wise to keep the ratio of processor count to system size in each direction fixed (i.e. cubic domains): xproc_0 / nx_0 = xproc_1 / nx_1, since this minimizes surface area to volume. Feel free to experiment, you might learn something about algorithms to optimize communication relative to work.
A strong scaling test fixes the total problem size and compares the execution time for different numbers of processors. Strong scaling increases the ratio of inter-processor communication (surface) to intra-processor work (volume).
Examples,
The domain decomposition requires O(1000) atoms per domain and begins to scale poorly for small numbers of atoms per domain. Again, feel free to experiment, you might learn something here as well. For example, when molecular dynamics codes were written for vector supercomputers, large lists of force pairs were created for the vector processor. These force lists provide a natural force decomposition for early parallel computers (Fast Parallel Algorithms for Short-Range Molecular Dynamics, S. J. Plimpton, J Comp Phys, 117, 1-19 (1995).) Using replicated data, force decomposition can scale to fewer than one atom per processor and is a natural mechanism to exploit intra-processor parallelism.
For further details see for example: https://support.scinet.utoronto.ca/wiki/index.php/Introduction_To_Performance