NVIDIA GPUs and CUDA grew substantially in performance and capabilities during the early stages of CUDA support in NAMD. One of the first CUDA-accelerated applications was NAMD, a widely used parallel molecular dynamics simulation engine. For more details, read Using Graphics Processors to Accelerate Molecular Modeling Applications and Adapting a Message-Driven Parallel Application to GPU-Accelerated Clusters.
On the Oak Ridge National Laboratory (ORNL) Summit system, the debut of the Volta GPU generation brought NAMD to bear on record-breaking simulation sizes. Read Scalable Molecular Dynamics with NAMD on the Summit System and Early Experiences Porting the NAMD and VMD Molecular Simulation and Analysis Software to GPU-Accelerated OpenPOWER Platforms for further information.
However, the initial NAMD software design became a barrier to additional performance advances for small and moderate-sized molecular dynamics simulations because host CPUs and PCIe did not match the performance gains made by GPUs during the same period. The capability given by Volta GPUs effectively rendered NAMD v2 CPU-bound, necessitating a different strategy to overcome this performance barrier.
With expected GPU performance gains in the future and the growing availability of a variety of dense-GPU HPC platforms, the NAMD team decided to make a significant shift in strategy for NAMD v3, moving away from the traditional “GPU-accelerated” scheme and toward a completely “GPU-resident” mode of operation.
NAMD v3’s new GPU-resident mode is designed for single-node single-GPU simulations, as well as multi-copy and replica-exchange molecular dynamics simulations on GPU clusters and dense multi-GPU systems such as the DGX-2 and DGX-A100. The NAMD v3 GPU-resident single-node computing strategy has significantly reduced NAMD’s reliance on CPU and PCIe performance, resulting in an incredible performance for small and moderate-sized simulations on cutting-edge NVIDIA Ampere GPUs and dense multi-GPU platforms like the DGX-A100.
The preliminary findings reported in this piece are highly promising, but we must keep in mind that there is still much more work to be done. As NAMD v3 matures, we expect performance to improve even more, and we’ll start working on strong scaling with NVLink- and NVSwitch-connected systems later.
After switching to “GPU-resident” mode, NAMD v3 becomes GPU-bound. On the same GPU, v3 is up to 1.9X quicker than v2.13. Because v3 is GPU-bound, it benefits more from newer GPUs: on the new A100, v3 is up to 1.4X quicker than on the V100. Finally, because v3 is GPU-bound, it scales significantly better on multi-GPU systems for multi-copy simulation. On the 8-GPU DGX systems, the combined results of such software and hardware advancements show that v3 throughput on A100 is up to 9X higher than v2.13 on V100.
To propagate molecular systems through time, NAMD uses a time-stepping technique. Each timestep is typically 1 or 2 femtoseconds long, with the phenomena to be observed occurring on the nanosecond to microsecond timescale. Performing millions upon millions of timesteps is required.
The four primary computational bottlenecks in a traditional timestep are as follows:
Short-range non-bonded forces
Evaluation of force and energy terms of atoms that are under a given cutoff and are deemed to interact directly 90% of total FLOPS
Long-range forces via Particle-mesh Ewald (PME)
PME is a fast approximation of long-range electrostatic interactions on the reciprocal space, which includes a few Fourier transforms (around 5% of total FLOPS).
Bonded terms from a molecular topology are calculated, weighted, and applied to their associated atoms, accounting for approximately 4% of total FLOPS.
Calculated forces are applied to atom velocities, and the system is propagated in time via a velocity-verlet process.
NAMD was one of the first scientific algorithms to incorporate GPU acceleration. This method allows NAMD to optimize all computer resources, allowing it to get the most out of Maxwell and previous GPU generations. However, since the debut of the Pascal GPU architecture, NAMD has had problems filling GPUs with work. The most current releases of NAMD (2.13 and 2.14) may offload all force terms and retrieve the results back to the CPU host, where numerical integration occurs.
Gaps in the blue strip indicate that the GPU is idling across the simulation since the remaining task is being handled by the CPU: numerical integration. The takeaway message is that current GPUs are so efficient that outsourcing even a small fraction of the total FLOPs to the CPU causes the entire simulation to bottleneck. Move the numerical integration step to GPUs as well to continue to benefit from GPU processing capability.
NAMD spatially decomposes the molecular system into a succession of sub-domains to enable scalable parallelism, much like a three-dimensional patchwork quilt where each patch represents a subset of atoms. Patches are normally in charge of maintaining relevant atomic data including total forces, locations, and velocities, as well as propagating their subset of atoms through time using the velocity-verlet algorithm.
Moving this code to the GPU necessitates dealing with patch organization. Numerical integration processes can be easily ported to CUDA because the majority of its methods are data-parallel. Patches are too fine-grained to fully occupy GPUs with enough integration operations due to the focus on scalability. Furthermore, NAMD has a default infrastructure for communicating GPU-calculated forces to patches on remote nodes to deal with remote forces that may arrive from various computing nodes.
As a result, you can’t rely on the present NAMD patch-wise integration technique to take advantage of GPU power during integration. However, due to remote forces arriving from various nodes, you must pass via that infrastructure in order to convey information.
To address this issue, we felt that having a separate code path for single-node simulations would be beneficial, allowing us to avoid having to go through the CPU methods for communicating forces. We devised a plan to accomplish the following:
- Fetch forces from the GPU force kernels as soon as they are calculated.
- To allow for a regular memory access pattern during integration, format them into a structure-of-arrays data structure.
- To begin a single integration task (with multiple kernels) for the full molecular system, combine all data from all patches in the simulation.
Patches no longer contain meaningful atomic data and only depict spatial decomposition. With this architecture in place, the CPU can only issue kernels and handle I/O, while current GPUs can conduct all of the necessary math.
As previously stated, the vast majority of integration kernels are data-parallel operations. Some of them, however, necessitate attention while migrating. It is usually advised to confine the hydrogen bond lengths throughout the simulation if the user demands longer timesteps (2 femtoseconds or more) (traditionally called rigid bonds constraints).
This operation occurs during numerical integration and, if performed poorly, could become a bottleneck. We created a single kernel to solve all of the system’s constraints, modifying the SETTLE algorithm for water and the Matrix-SHAKE variation for hydrogen bonds that aren’t water.
This scheme is only available in NAMD3.0 alpha builds and presently only supports single-GPU runs. Support is currently being developed for single-trajectory simulations using numerous, fully networked NVIDIA GPUs.
Better performance with NAMD v3
Add the following parameter to the NAMD config command to enable the quick single-GPU code path.
Set CUDASOAintegrate to on in your NAMD settings file.
It is critical to properly configure the following performance tuning parameters for optimal performance.
stepsPerCycle (NAMD 2.x default = 20)
pairlistsPerCycle (NAMD 2.x default = 2)
margin (NAMD 2.x = 0)
When CUDASOAintegrate is enabled and left as UNDEFINED in the simulation configuration file, NAMD sets them to the suggested settings.
Furthermore, frequent production has a negative impact on performance. Most crucially, outputEnergies (which are set to 1 by default in NAMD 2.x) should be increased significantly. If the simulation configuration file is left UNDEFINED with CUDASOAintegrate enabled, NAMD sets the following parameters to the same value as stepsPerCycle, the default.
Minimization is not yet possible on the fast path. Minimization and dynamics should be done in two distinct NAMD runs for inputs that require minimization. For the minimization run, the CUDASOAintegrate parameter should be set to off. The dynamics run can then be resumed using the fast path.
The atom reassignment is still performed on the CPU in NAMD v3. Using many CPU cores still has a slight advantage. Figure 6 depicts v3 performance for the Satellite Tobacco Mosaic Virus (STMV) and APOA1 issues utilizing various CPU cores.
First, the performance of different CPU cores does not differ considerably. Compared to one core, utilizing ten cores improves STMV performance by 10%. However, after 10 cores, the benefit reaches a halt. This is to be expected, as atom assignment takes up a small portion of overall time, and the benefits of adding more CPU cores diminishes.
On the other hand, utilizing four cores yields a 5% improvement over using one core for the smaller APOA1 problem. Using more than four cores, however, reduces performance slightly. This is due to the fact that more CPU threads would result in more GPU calculation overhead. When using more CPU cores for the smaller APOA1 problem, the overheads can become non-negligible.
It’s sufficient to use ten CPU cores for huge tasks like STMV. It’s ideal to employ only a few CPU cores, such as four, for minor issues like APOA1.
single GPU Performance
Because v2.13 is CPU-bound, all available CPU cores are used. We used 16 cores for STMV and four cores for APOA1 in version 3.
On V100, the following data demonstrate that v3 is 1.5-1.9 times faster than v2.13. Furthermore, for STMV and APOA1, v3 on A100 is 1.4X quicker than V100 and 1.3X faster than V100. Because of the substantially lower atom count, APOA1 showed a smaller benefit on A100 because it couldn’t fully utilize all A100 processing resources.
using MPS To Improve Short Job performance
Minor problems like APOA1 can’t fully utilize all A100 resources, as we explained before in this piece. In those scenarios, MPS can be used to run several instances on a single GPU simultaneously, resulting in increased overall throughput.
We use MPS to run many instances of the APOA1 NVE problem on a single V100 and A100 to show this. The scaling on V100 and A100 is shown in Figure 9. Using MPS to run multiple instances improves APOA1 NVE performance by 1.1X on V100 and 1.2X on A100. Because A100 has more streaming multiprocessors than V100, it received more advantages because it was underutilized. As a result, saturating the V100 takes around two occasions, but saturating the A100 takes about three instances.
It’s also worth noting that APOA1 NVE’s saturated performance on A100 is roughly 1.4X that on V100, similar to the more significant problem STMV, as expected. Finally, it’s worth mentioning that you might use MPS to solve various input issues and achieve comparable results.
Increase short problem performance using MIG
Multi-Instance GPU is a new feature of the A100 (MIG). MIG divides the A100 GPU into numerous instances, each with its own set of computation, memory, cache, and memory bandwidth capabilities. This enables the execution of multiple processes in parallel while maintaining hardware separation for predictable latency and throughput. Another option to enhance GPU utilization for simple issues is to use this technique.
The most significant distinction between MIG and MPS is that MPS does not divide hardware resources for application processes. MIG is more secure than MPS because it prevents one method from interfering. The A100 GPU can be partitioned into 1/2/3/7 instances using MIG. Figure 10 depicts the APOA1 NVE performance of 1/2/3/7 cases running on a single A100 GPU utilizing MIG.
Multi-copy simulation scaling on multi-GPU
Even though the GPU improvements in v3 are currently limited to single-GPU use cases, such as multi-copy simulations, they benefit some multi-GPU use cases. When many replica simulations are run simultaneously on a multi-GPU system, the available CPU cores for each run are reduced compared to when only one simulation is run. Multi-copy scalability is projected to be less than ideal because NAMD v2.13 is CPU-bound. On the other side, because v3’s performance is less dependent on the CPU, it would scale much better.
To demonstrate this, we ran numerous STMV NVE runs on multiple GPUs simultaneously. Each run used all of the CPU cores available, divided by the number of GPUs. From one to eight GPUs, Figure 11 compares the scalability of v2.13 and v3. As you can see, v2.13 doesn’t scale well: eight V100s are around 2.2x faster than one V100. Using the other hand, on eight V100s and eight A100s, v3 produced a nearly linear speedup. As a result, v3 on eight A100s performs 9X better than v2.13 on eight V100s in multi-copy simulation.