Emulating DMA Engines on GPUs for Performance and Portability

Supercomputing 2011 Paper

Getting Started

Programming Model

CudaDMA Object API

Version 2.0

CudaDMA Sequential

CudaDMA Strided

CudaDMA Indirect

Best Practices

Buffering Techniques


CudaDMA Object API

Every CudaDMA object must implement the base CudaDMA object API. The API consists of a constructor and collection of methods on the CudaDMA object that enable the programmer to:

The template of the CudaDMA API for each transfer pattern object is shown below.

class cudaDMA {
  // Base Constructor
  __device__ cudaDMA (const int dmaID,
                      const int num_dma_threads,
                      const int num_compute_threads,
                      const int dma_threadIdx_start);
  __device__ bool owns_this_thread();
  // Compute thread synchronization functions
  __device__ void start_async_dma();
  __device__ void wait_for_dma_finish();
  // DMA thread synchronization functions
  __device__ void wait_for_dma_start();
  __device__ void finish_async_dma();
  // DMA thread transfer operations
  __device__ void execute_dma(void *src_ptr,
                              void *dst_ptr) const;
  __device__ void execute_dma_no_sync(void *src_ptr,
                                      void *dst_ptr) const;


The constructor for a CudaDMA object provides information to the implementation of the object that enables it to function correctly. These fields must be supplied as a part of all CudaDMA objects. The fields supplied here only matter for the case where warp specialization is being used as described by the CudaDMA programming model.

Managing DMA and Compute Warps

There are several different function calls for managing DMA threads and allowing them to synchronize with compute threads. We describe each of them in turn.

Performing Transfers

There are two different function calls for performing data transfers. They both take the same arguments and will perform the same operation. The difference is whether or not synchronization is performed by the DMA threads when executing the DMA operation.