Emulating DMA Engines on GPUs for Performance and Portability
CudaDMAStrided Version 2.0 has the same semantics and is characterized by the same parameters as the original CudaDMAStrided.
Constructors support from 3-6 template parameters for the new CudaDMAStrided transfer pattern. All constructors support the new option for specifying the number of BYTES_PER_THREAD for outstanding LDG loads. The value of BYTES_PER_THREAD must be a multiple of ALIGNMENT. By selecting 4*ALIGNMENT the implementation will default to the Fermi implementation.
/* Constructors for use with warp specialization */
CudaDMAStrided<true/*specialized warps*/, ALIGNMENT,
BYTES_PER_THREAD, BYTES_PER_ELMT,
DMA_THREADS, NUM_ELMTS>
(dmaID, num_compute_threads, dma_threadIdx_start,
src_stride, dst_stride);
CudaDMAStrided<true, ALIGNMENT, BYTES_PER_THREAD,
BYTES_PER_ELMT, DMA_THREADS, NUM_ELMTS>
(dmaID, num_compute_threads, dma_threadIdx_start,
src_stride); // dst_stride == src_stride
CudaDMAStrided<true, ALIGNMENT, BYTES_PER_THREAD,
BYTES_PER_ELMT, DMA_THREADS>
(dmaID, num_compute_threads, dma_threadIdx_start,
num_elmts, src_stride, dst_stride);
CudaDMAStrided<true, ALIGNMENT, BYTES_PER_THREAD,
BYTES_PER_ELMT, DMA_THREADS>
(dmaID, num_compute_threads, dma_threadIdx_start,
num_elmts, src_stride); // dst_stride == stride
CudaDMAStrided<true, ALIGNMENT,
BYTES_PER_THREAD, BYTES_PER_ELMT>
(dmaID, num_dma_threads, num_compute_threads, dma_threadIdx_start,
num_elmts, src_stride, dst_stride);
CudaDMAStrided<true, ALIGNMENT,
BYTES_PER_THREAD, BYTES_PER_ELMT>
(dmaID, num_dma_threads, num_compute_threads, dma_threadIdx_start,
num_elmts, src_stride); // dst_stride == src_stride
CudaDMAStrided<true, ALIGNMENT, BYTES_PER_THREAD>
(dmaID, num_dma_threads, num_compute_threads, dma_threadIdx_start,
bytes_per_elmt, num_elmts, src_stride, dst_stride);
CudaDMAStrided<true, ALIGNMENT, BYTES_PER_THREAD>
(dmaID, num_dma_threads, num_compute_threads, dma_threadIdx_start,
bytes_per_elmt, num_elmts, src_stride); // dst_stride == src_stride
/* Constructors for use without warp specialization */
CudaDMAStrided<false/*not specialized*/, ALIGNMENT, BYTES_PER_THREAD,
BYTES_PER_ELMT, TOTAL_THREADS, NUM_ELMTS>
(src_stride, dst_stride, dma_threadIdx_start/*optional*/);
CudaDMAStrided<false, ALIGNMENT, BYTES_PER_THREAD,
BYTES_PER_ELMT, TOTAL_THREADS, NUM_ELMTS>
(src_stride); // dst_stride == src_stride
CudaDMAStrided<false, ALIGNMENT, BYTES_PER_THREAD,
BYTES_PER_ELMT, TOTAL_THREADS>
(num_elmts, src_stride, dst_stride, dma_threadIdx_start/*optional*/);
CudaDMAStrided<false, ALIGNMENT, BYTES_PER_THREAD,
BYTES_PER_ELMT, TOTAL_THREADS>
(num_elmts, src_stride); // dst_stride == src_stride
CudaDMAStrided<false, ALIGNMENT,
BYTES_PER_THREAD, BYTES_PER_ELMT>
(num_elmts, src_stride, dst_stride,
num_dma_threads/*optional*/, dma_threadIdx_start/*optional*/);
CudaDMAStrided<false, ALIGNMENT, BYTES_PER_THREAD, BYTES_PER_ELMT>
(num_elmts, src_stride); // dst_stride == src_stride
CudaDMAStrided<false, ALIGNMENT, BYTES_PER_THREAD>
(bytes_per_elmt, num_elmts, src_stride, dst_stride,
num_dma_threads/*optional*/, dma_threadIdx_start/*optional*/);
CudaDMAStrided<false, ALIGNMENT, BYTES_PER_THREAD>
(bytes_per_elmt, num_elmts, src_stride); // dst_stride == src_stride
Unlike previous versions of CudaDMA, the non-warp-specialized implementations also allow you to specify that a subset of the available warps should be used. These are optional parameters. Not specifying them will default to using all the threads in a threadblock for the transfer.
CudaDMAStrided supports the following transfer functions.
class CudaDMAStrided {
public:
// One-Phase Versions
__device__ void execute_dma(const void *src_ptr, void *dst_ptr);
template<bool GLOBAL_LOAD>
__device__ void execute_dma(const void *src_ptr, void *dst_ptr);
template<bool GLOBAL_LOAD, CudaDMALoadQualifier LOAD_QUAL,
CudaDMAStoreQual STORE_QUAL>
__device__ void execute_dma(const void *src_ptr, void *dst_ptr);
// Two-Phase Versions
__device__ void start_xfer_async(const void *src_ptr);
template<bool GLOBAL_LOAD>
__device__ void start_xfer_async(const void *src_ptr);
template<bool GLOBAL_LOAD, CudaDMALoadQualifier LOAD_QUAL,
CudaDMAStoreQual STORE_QUAL>
__device__ void start_xfer_async(const void *src_ptr);
__device__ void wait_xfer_finish(void *dst_ptr);
template<bool GLOBAL_LOAD>
__device__ void wait_xfer_finish(void *dst_ptr);
template<bool GLOBAL_LOAD, CudaDMALoadQualifier LOAD_QUAL,
CudaDMAStoreQual STORE_QUAL>
__device__ void wait_xfer_finish(void *dst_ptr);
};
CudaDMAStrided implements the following host-side diagnostic function. It should be invoked with no template parameters.
template<...>
class CudaDMAStrided
{
public:
__host__ static
void diagnose(int alignment, int bytes_per_thread,
int bytes_per_elmt, int num_dma_threads,
int num_elements, bool fully_templated,
bool verbose = false);
};
// Example invocation
CudaDMAStridedi<>::(/*arguments*/);
The CudaDMAStrided pattern is used for transferring multiple blocks of data that are evenly spaced apart in memory. Every block of memory must be of the same size. We refer to these blocks of memory as elements.
The CudaDMAStrided pattern can be characterized by five different parameters.
Both stride parameters are from the start of one element to the start of the next element. The destination stride must be at least as large as BYTES_PER_ELMT or else the behavior of the transfer is undefined. Also, the ALIGNMENT parameter must apply to be the source and destination locations of every element. This implies that both src_stride and dst_stride must be divisible by the ALIGNMENT parameter.
The CudaDMAStrided pattern can be instantiated by a variety of constructors that allow different parameters to be passed as compile-time constants via template parameters. In addition, there are are special constructors for the common case where the destination stride is equal to size of the elements. Models for each of the different constructors for the CudaDMAStrided pattern can be seen below.
/* Constructors for use with warp specialization */
cudaDMAStrided<true/*specialized warps*/, ALIGNMENT,
BYTES_PER_ELMT, DMA_THREADS, NUM_ELMTS>
(dmaID, num_compute_threads, dma_threadIdx_start,
src_stride, dst_stride);
cudaDMAStrided<true, ALIGNMENT, BYTES_PER_ELMT,
DMA_THREADS, NUM_ELMTS>
(dmaID, num_compute_threads, dma_threadIdx_start,
src_stride); // dst_stride == BYTES_PER_ELMT
cudaDMAStrided<true, ALIGNMENT, BYTES_PER_ELMT, DMA_THREADS>
(dmaID, num_compute_threads, dma_threadIdx_start,
num_elmts, src_stride, dst_stride);
cudaDMAStrided<true, ALIGNMENT, BYTES_PER_ELMT, DMA_THREADS>
(dmaID, num_compute_threads, dma_threadIdx_start,
num_elmts, src_stride); // dst_stride == BYTES_PER_ELMT
cudaDMAStrided<true, ALIGNMENT, BYTES_PER_ELMT>
(dmaID, num_dma_threads, num_compute_threads, dma_threadIdx_start,
num_elmts, src_stride, dst_stride);
cudaDMAStrided<true, ALIGNMENT, BYTES_PER_ELMT>
(dmaID, num_dma_threads, num_compute_threads, dma_threadIdx_start,
num_elmts, src_stride); // dst_stride == BYTES_PER_ELMT
cudaDMAStrided<true, ALIGNMENT>
(dmaID, num_dma_threads, num_compute_threads, dma_threadIdx_start,
bytes_per_elmt, num_elmts, src_stride, dst_stride);
cudaDMAStrided<true, ALIGNMENT>
(dmaID, num_dma_threads, num_compute_threads, dma_threadIdx_start,
bytes_per_elmt, num_elmts,
src_stride); // dst_stride == BYTES_PER_ELMT
/* Constructors for use without warp specialization */
cudaDMAStrided<false/*not specialized*/, ALIGNMENT,
BYTES_PER_ELMT, TOTAL_THREADS, NUM_ELMTS>
(src_stride, dst_stride);
cudaDMAStrided<false, ALIGNMENT, BYTES_PER_ELMT,
TOTAL_THREADS, NUM_ELMTS>
(src_stride); // dst_stride == BYTES_PER_ELMT
cudaDMAStrided<false, ALIGNMENT, BYTES_PER_ELMT, TOTAL_THREADS>
(num_elmts, src_stride, dst_stride);
cudaDMAStrided<false, ALIGNMENT, BYTES_PER_ELMT, TOTAL_THREADS>
(num_elmts, src_stride); // dst_stride == BYTES_PER_ELMT
cudaDMAStrided<false, ALIGNMENT, BYTES_PER_ELMT>
(num_elmts, src_stride, dst_stride);
cudaDMAStrided<false, ALIGNMENT, BYTES_PER_ELMT>
(num_elmts, src_stride); // dst_stride == BYTES_PER_ELMT
cudaDMAStrided<false, ALIGNMENT>
(bytes_per_elmt, num_elmts, src_stride, dst_stride);
cudaDMAStrided<false, ALIGNMENT>
(bytes_per_elmt, num_elmts,
src_stride); // dst_stride == BYTES_PER_ELMT
Each constructor takes the base set of CudaDMA object arguments required by the CudaDMA API. In addition, each constructor takes the set of arguments required to fully characterize the CudaDMAStrided transfer pattern.
For the non-warp-specialized constructors, the total threads parameter describes the total number of threads to be used to perform the transfer. In the cases where total threads is not supplied as a compile-time constant, the number of threads defaults to blockDim.x. This is only true for the non-warp-specialized case.
The highest possible performance for CudaDMAStrided pattern will be achieved when as many arguments as possible are specified as compile-time constants. It's important to note that the number of DMA threads is more important as a compile-time constant than the number of elements to be transferred. Higher performance can also be achieved by ensuring larger alignment of the elements. 16-byte alignment will perform better than 8-byte alignment, and 8-byte alignment will perform better than 4-byte alignment.