CoMD implements a simple and extensible system of internal timers to measure the performance profile of the code.

As explained in performanceTimers.c, it is easy to create additional timers and associate them with code regions of specific interest. In addition, the getTime() and getTick() functions can be easily reimplemented to take advantage of platform specific timing resources.

A timing report is printed at the end of each simulation.

Timings for Rank 0
        Timer        # Calls    Avg/Call (s)   Total (s)    % Loop
___________________________________________________________________
total                      1      50.6701       50.6701      100.04
loop                       1      50.6505       50.6505      100.00
timestep                   1      50.6505       50.6505      100.00
  position             10000       0.0000        0.0441        0.09
  velocity             20000       0.0000        0.0388        0.08
  redistribute         10001       0.0003        3.4842        6.88
    atomHalo           10001       0.0002        2.4577        4.85
  force                10001       0.0047       47.0856       92.96
    eamHalo            10001       0.0001        1.0592        2.09
commHalo               60006       0.0000        1.7550        3.46
commReduce                12       0.0000        0.0003        0.00
Timing Statistics Across 8 Ranks:
        Timer        Rank: Min(s)       Rank: Max(s)      Avg(s)    Stdev(s)
_____________________________________________________________________________
total                3:   50.6697       0:   50.6701     50.6699      0.0001
loop                 0:   50.6505       4:   50.6505     50.6505      0.0000
timestep             0:   50.6505       4:   50.6505     50.6505      0.0000
  position           2:    0.0437       0:    0.0441      0.0439      0.0001
  velocity           2:    0.0380       4:    0.0392      0.0385      0.0004
  redistribute       0:    3.4842       1:    3.7085      3.6015      0.0622
    atomHalo         0:    2.4577       7:    2.6441      2.5780      0.0549
  force              1:   46.8624       0:   47.0856     46.9689      0.0619
    eamHalo          3:    0.2269       6:    1.2936      1.0951      0.3344
commHalo             3:    1.0803       6:    2.1856      1.9363      0.3462
commReduce           6:    0.0002       2:    0.0003      0.0003      0.0000
---------------------------------------------------
 Average atom update rate:   9.39 us/atom/task
---------------------------------------------------

This report consists of two blocks. The upper block lists the absolute wall clock time spent in each timer on rank 0 of the job. The lower block reports minimum, maximum, average, and standard deviation of times across all tasks. The ranks where the minimum and maximum values occured are also reported to aid in identifying hotspots or load imbalances.

The last line of the report gives the atom update rate in microseconds/atom/task. Since this quantity is normalized by both the number of atoms and the number of tasks it provides a simple figure of merit to compare performance between runs with different numbers of atoms and different numbers of tasks. Any increase in this number relative to a large number of atoms on a single task represents a loss of parallel efficiency.

Choosing the problem size correctly has important implications for the reported performance. Small problem sizes may run entirely in the cache of some architectures, leading to very good performance results. For general characterization of performance, it is probably best to choose problem sizes which force the code to access main memory, even though there may be strong scaling scenarios where the code is indeed running mainly in cache.