CudaDMA

Emulating DMA Engines on GPUs for Performance and Portability

Supercomputing 2011 Paper

Getting Started

Programming Model

CudaDMA Object API

Version 2.0

CudaDMA Sequential

CudaDMA Strided

CudaDMA Indirect

Best Practices

Buffering Techniques

Restrictions

CudaDMASequential Version 2.0

CudaDMASequential Version 2.0 has the same semantics and is characterized by the same parameters as the original CudaDMASequential.

Constructors

From 3-5 template parameters are supported for the new CudaDMASequential transfer pattern. All constructors support the new option for specifying the number of BYTES_PER_THREAD for outstanding LDG loads. The value of BYTES_PER_THREAD must be a multiple of ALIGNMENT. By selecting 4*ALIGNMENT the implementation will default to the Fermi implementation.


/*************************************************/
/* Constructors for use with warp-specialization */
CudaDMASequential<true/*specialized*/, ALIGNMENT, BYTES_PER_THREAD, 
                  XFER_SIZE, DMA_THREADS>
  (dmaID, num_compute_threads, dma_threadIdx_start);

CudaDMASequential<true, ALIGNMENT, BYTES_PER_THREAD, XFER_SIZE>
  (dmaID, num_dma_threads, num_compute_threads, dma_threadIdx_start);

CudaDMASequential<true, ALIGNMENT, BYTES_PER_THREAD>
  (dmaID, num_dma_threads, num_compute_threads, 
   dma_threadIdx_start, xfer_size);

/****************************************************/
/* Constructors for use without warp-specialization */
CudaDMASequential<false/*not specialized*/, ALIGNMENT, 
                  BYTES_PER_THREAD, XFER_SIZE, DMA_THREADS>
  (dma_threadIdx_start/*optional*/);

CudaDMASequential<false, ALIGNMENT, BYTES_PER_THREAD, XFER_SIZE>
  (num_dma_threads/*optional*/, dma_threadIdx_start/*optional*/);

CudaDMASequential<false, ALIGNMENT, BYTES_PER_THREAD>
  (xfer_size, num_dma_threads/*optional*/, 
   dma_threadIdx_start/*optional*/);

Unlike previous versions of CudaDMA, the non-warp-specialized implementations also allow you to specify that a subset of the available warps should be used. These are optional parameters. Not specifying them will default to using all the threads in a threadblock for the transfer.

Transfer Functions

CudaDMASequential supports the following transfer functions.


class CudaDMASequential {
public:
  // One-Phase Versions
  __device__ void execute_dma(const void *src_ptr, void *dst_ptr);

  template<bool GLOBAL_LOAD>
  __device__ void execute_dma(const void *src_ptr, void *dst_ptr);

  template<bool GLOBAL_LOAD, CudaDMALoadQualifier LOAD_QUAL, 
           CudaDMAStoreQual STORE_QUAL>
  __device__ void execute_dma(const void *src_ptr, void *dst_ptr);

  // Two-Phase Versions
  __device__ void start_xfer_async(const void *src_ptr);

  template<bool GLOBAL_LOAD>
  __device__ void start_xfer_async(const void *src_ptr);

  template<bool GLOBAL_LOAD, CudaDMALoadQualifier LOAD_QUAL, 
           CudaDMAStoreQual STORE_QUAL>
  __device__ void start_xfer_async(const void *src_ptr);

  __device__ void wait_xfer_finish(void *dst_ptr);

  template<bool GLOBAL_LOAD>
  __device__ void wait_xfer_finish(void *dst_ptr);

  template<bool GLOBAL_LOAD, CudaDMALoadQualifier LOAD_QUAL, 
           CudaDMAStoreQual STORE_QUAL>
  __device__ void wait_xfer_finish(void *dst_ptr);
};

Diagnostic Functions

CudaDMASequential implements the following host-side diagnostic function. It should be invoked with no template parameters.


template<...>
class CudaDMASequential
{
public:
  __host__
  static void diagnose(int alignment, int bytes_per_thread, 
                       int bytes_per_elmt, int num_dma_threads, 
                       bool fully_templated, bool verbose = false);
};

// Example invocation
CudaDMASequential<...>::(/*arguments*/);

CudaDMASequential Version 1.0

Pattern Description

The CudaDMASequential transfer pattern is used for transferring a contiguous block of memory. There are only two parameters required to characterize a sequential transfer pattern.

Constructors

There are three constructors for the CudaDMASequential transfer pattern. Different constructors all describe the same sequential transfer pattern, but allow for different parameters to be supplied as compile-time constants via template parameters. Below are models for invoking the three constructors for CudaDMASequential.


/* Constructors for use with warp specialization */
cudaDMASequential<true/*specialized*/, ALIGNMENT,
                  XFER_SIZE, DMA_THREADS>
  (dmaID, num_compute_threads, dma_threadIdx_start);

cudaDMASequential<true,ALIGNMENT,XFER_SIZE>
  (dmaID, num_dma_threads, num_compute_threads, dma_threadIdx_start);

cudaDMASequential<true,ALIGNMENT>
  (dmaID, num_dma_threads, num_compute_threads, 
   dma_threadIdx_start, xfer_size);

/* Constructors for use without warp specialization */
cudaDMASequential<false/*not specialized*/,ALIGNMENT,
                  XFER_SIZE,TOTAL_THREADS>
  ();

cudaDMASequential<false,ALIGNMENT,XFER_SIZE>
  ();

cudaDMASequential<false,ALIGNMENT>
  (xfer_size);

The first constructor allows the user to supply the most number of compile-time constants as template parameters. The user can specify the ALIGNMENT, XFER_SIZE and the number of DMA_THREADS as compile-time constants. The second constructor keeps XFER_SIZE as a compile time constant, while making the number of DMA threads a dynamic parameter. The last constructor moves the transfer size parameter to being a dynamic parameter as well. All other parameters are base parameters required by the CudaDMA API.

For the non-warp-specialized constructors, the total threads parameter indicates to the CudaDMA object how many threads should be used to perform the transfer. For the cases where total threads is not specified as a compile-time constant, we use blockDim.x as the number of threads to perform the transfer.

Performance Considerations

Supplying as many parameters as possible as compile-time constants will contribute the most to achieving high performance with CudaDMASequential. In addition to supplying compile-time constants, performance can also be achieved by aligning data to the largest byte-alignment possible. 16-byte alignment will perform better than 8-byte alignment, and 8-byte alignment will perform better than 4-byte alignment.