14 KiB
Usage and Examples
This library is broken up into three main parts, as well as a certain compilation and linking framework:
Core ExamplesArray ExamplesBLAS ExamplesCompilation and LinkingNotes
The Core.h header contains the necessary macros, flags
and objects for interfacing with basic kernel launching and the CUDA
Runtime API. The Array.h header contains the
CudaTools::Array class which provides a device compatible
Array-like class with easy memory management. Lastly, the
BLAS.h header provides functions BLAS functions through the
the cuBLAS library for the GPU, and Eigen for the CPU. Lastly, a
templated Makefile is provided which can be used for your own project,
after following a few rules.
The usage of this libary will be illustrated through examples, and further details can be found in the other sections. The examples are given in the samples folder. Throughout this documentation, there are a few common terms that may appear. First,we refer to the CPU as the host, and the GPU as the device. So, a host function refers to a function runnable on the CPU, and a device function refers to a function that is runnable on a device. A kernel is a specific function that the host can call to be run on the device.
Core Examples
This file mainly introduces compiler macros and a few classes that are used to improve the syntax between host and device code. To define and call a kernel, there are a few macros provided. For example,
DEFINE_KERNEL(add, int x, int y) {
printf("Kernel: %i\n", x + y);
}
int main() {
KERNEL(add, CudaTools::Kernel::basic(1), 1, 1); // Prints 2.
return 0;
}The DEFINE_KERNEL(name, ...) macro takes in the function
name and its arguments. The second argument in the KERNEL()
macro is are the launch parameters for kernel. The launch parameters
have several items, but for 'embarassingly parallel' cases, we can
simply generate the settings with the number of threads. More detail
with creating launch parameters can be found here <CudaTools::Kernel::Settings>. In the above
example, there is only one thread. The rest of the arguments are just
the kernel arguments. For more detail, see here <Macro Functions>.
Warning
These kernel definitions must be in a file that will be compiled by
nvcc. Also, for header files, there is an additional macro
DECLARE_KERNEL(name, ...) to declare it and make it
available to other files.
Since many applications used classes, a macro is provided to 'convert' a class into being device-compatible. We follow the previous example in a similar fashion.
class intPair {
DEVICE_CLASS(intPair)
public:
int x, y;
intPair(const int x_, const int y_) : x(x_), y(y_) {
allocateDevice(); // Allocates memory for this intPair on the device.
updateDevice().wait(); // Copies the memory on the host to the device and waits until finished.
};
~intPair() { CudaTools::free(that()); };
HD void swap() {
int swap = x;
x = y;
y = swap;
};
};
DEFINE_KERNEL(swap, intPair* const pair) { pair->swap(); }
int main() {
intPair pair(1, 2);
printf("Before: %u, %u\n", pair.x, pair.y); // Prints 1, 2.
KERNEL(swap, CudaTools::Kernel::basic(1), pair.that()).wait();
pair.updateHost().wait(); // Copies the memory from the device back to the host and waits until finished.
printf("After: %u, %u\n", pair.x, pair.y); // Prints 2, 1.
return 0;
}In this example, we create a class called intPair, which
is then made available on the device through the
DEVICE_CLASS(name) macro. Specifically, that macro
introduces a few functions, like allocateDevice(),
updateDevice(), updateHost(), and
that(). The that() function returns a pointer
to the copy on the device. As a result, the programmer
must define a destructor that frees the pointer using
CudaTools::free(that). For more details, see here <Device Class>.
Warning
The updateDevice() and updateHost() in most
cases will need to be explicitly called to push the data on the host to
the device, and vice-versa. It is the programmers job to maintain where
the 'most recent' copy is. If these are not called, various memory
errors can occur. Note that, when passing a pointer to the kernel, it
must be the device pointer. Otherwise, an illegal memory access
would occur.
The kernel argument list should must consist of pointers to objects, or a non-reference object. Otherwise, compilation will fail. In general this is safer, as it forces the programmer to acknowledge that the device copy is being passed. For the latter case of a non-reference object, you should only do this if there is no issue in creating a copy of the original object. In the above example, we could have done this, but for more complicated classes it may result in unwanted behavior.
Lastly, since the point of classes is usually to have some member
functions, to have them available on the device, you must mark them with
the compiler macro HD in front.
We also introduce the wait() function, which waits for
the command to complete before continuing. Most calls that involve the
device are asynchronous, so without proper blocking, operations
dependent on a previous command are not guaranteed to run correctly. If
the code is compiled for CPU, then everything will run synchronously, as
per usual.
Note
Almost all functions that are asynchronous provide an optional
'stream' argument, where you can give the name of the stream you wish to
use. Different streams run asynchronous, but operations on the same
stream are FIFO. To define a stream to use later, you must call
CudaTools::Manager::get()->addStream("myStream") at some
point before you use it. For more details, see here <CudaTools::Manager>.
Array Examples
This file introduces the Array class, which is a class
that provides automatic memory management between device and host. In
particular, it provides functionality on both the host and device while
handling proper memory destruction, with many nice features. In
particular it supports mimics many features of the Python package
NumPy.` We can demonstrate a few here.
DEFINE_KERNEL(times2, const CudaTools::Array<int> arr) {
CudaTools::Array<int> flat = arr.flattened();
BASIC_LOOP(arr.shape().items()) { flat[iThread] *= 2; }
}
DEFINE_KERNEL(times2double, const CudaTools::Array<double> arr) {
CudaTools::Array<double> flat = arr.flattened();
BASIC_LOOP(arr.shape().items()) { flat[iThread] *= 2; }
}
int main() {
CudaTools::Array<int> arrRange = CudaTools::Array<int>::range(0, 10);
CudaTools::Array<int> arrConst = CudaTools::Array<int>::constant({10}, 1);
CudaTools::Array<double> arrLinspace = CudaTools::Array<double>::linspace(0, 5, 10);
CudaTools::Array<int> arrComma({2, 2}); // 2x2 array.
arrComma << 1, 2, 3, 4; // Comma initializer if needed.
arrRange.updateDevice();
arrConst.updateDevice();
arrLinspace.updateDevice();
arrComma.updateDevice().wait();
std::cout << "Before Kernel:\n";
std::cout << arrRange << "\n" << arrConst << "\n" << arrLinspace << "\n" << arrComma << "\n";
// Call the kernel multiple times asynchronously. Note: since they share same
// stream, they are not run in parallel, just queued on the device.
// NOTE: Notice that a view is passed into the kernel, not the Array itself.
KERNEL(times2, CudaTools::Kernel::basic(arrRange.shape().items()), arrRange.view());
KERNEL(times2, CudaTools::Kernel::basic(arrConst.shape().items()), arrConst.view());
KERNEL(times2double, CudaTools::Kernel::basic(arrLinspace.shape().items()), arrLinspace.view());
KERNEL(times2, CudaTools::Kernel::basic(arrComma.shape().items()), arrComma.view()).wait();
arrRange.updateHost();
arrConst.updateHost();
arrLinspace.updateHost();
arrComma.updateHost().wait(); // Same stream, so you should wait for the last call.
std::cout << "After Kernel:\n";
std::cout << arrRange << "\n" << arrConst << "\n" << arrLinspace << "\n" << arrComma << "\n";
return 0;
}In this example, we show a few ways to initialize an
Array through some static functions. It is templated, so it
can (theoretically) support any type. Additionally, you can initialize
an empty Array by providing its Shape with an
initializer list (ex: {2, 2}). Many of these array
functions and initializers have view-returning and self-assigning
versions. For instance, .flattened() returns a flattened
view of an Array, and does not modify the original. For more details,
see here <CudaTools::Array<T>>.
We also note the use of BASIC_LOOP(N), which is a macro
for generating the loop automatically on the kernel given the number of
threads. It is intended to be used only for "embarassingly parallel"
situations and with the CudaTools::Kernel::basic() launch
parameters. If compiling for CPU, it will mark the loop with
#pragma parallel for and attempt to use OpenMP for
parallelism.
Warning
Notice that a view must be passed to the kernel, and not the original object. This
The Array also supports other helpful functions, such as multi-dimensional indexing, slicing, and a few other functions.
int main() {
CudaTools::Array<int> arr = CudaTools::Array<int>::constant({100}, 0);
arr.reshape({4, 5, 5}); // Creates a three dimensional array.
arr[0][0][0] = 1; // Axis by axis indexing.
arr[{1, 0, 0}] = 100; // Specific 'coordinate' indexing.
std::cout << arr << "\n";
CudaTools::Array<int> arrRange = CudaTools::Array<int>::range(0, 18);
auto arrSlice = arr.slice({{1, 3}, {1, 4}, {1, 4}}); // Takes a slice of the center.
std::cout << "Before Copy:\n" << arrSlice << "\n";
arrSlice = arrRange; // Copies arrRange into arrSlice. (Does NOT replace!)
std::cout << "After Copy:\n" << arrSlice << "\n";
std::cout << "Modified: \n"
<< arr << "\n"; // The original array is modified, since a slice does not copy.
CudaTools::Array<int> newArr = arr.copy(); // Copies the original Array.
for (auto it = newArr.begin(); it != newArr.end(); ++it) { // Iterate through the array.
*it = 1;
}
std::cout << "Modified New Array:\n" << newArr << "\n";
std::cout << "Old Array:\n" << arr << "\n"; // The original array was not modified after a copy.
return 0;
}In this example, we demonstrate some of the functionality of the
Array. We can do multi-dimensional indexing, take slices of the Array,
and iterate through the Array through an iterator, in C++ fashion.
Particularly, we need to introduce the concept of a "view" of an Array.
An Array either "owns" its data or is a "view" of another Array. You can
create a view manually with the .view() function.
Warning
When using the assignment operator, if a view is on the left-hand side, it will perform a copy of the internal data. However, if the Array is an owner, then it will replace the entire Array, and free the old memory. This means any view of that previous array will now point to invalid places in memory. It is responsibility of the programmer to manage this.
BLAS Examples
Compilation and Linking
To compile with this library, there are only a few things necessary.
First, it is recommended you use the provided template
Makefile, which can be easily modified to suit your project
needs. It already default handles the compilation and linking with
nvcc, so long as you fulfill a few requirements.
- Use the compiler flag
CUDAto mark where GPU-specific code is, if necessary. - Any files that use
CUDAfunctionality, (i.e defining a kernel), should have the file extension.cu.cpp. - When including the
Core.hheader file, in only one file, you must define the macroCUDATOOLS_IMPLEMENTATION. That file must also compile for CUDA, so its extension must be.cu.cpp. It's recommended to put this with your kernel definitions.
Afterwards the Makefile will have two targets
cpu and gpu, which compile the CPU and GPU
compatible binaries respectively. As an example, we can look at the
whole file for the first example:
// main.cu.cpp
#define CUDATOOLS_IMPLEMENTATION
#include <Core.h>
DEFINE_KERNEL(add, int x, int y) {
printf("Kernel: %i\n", x + y);
}
int main() {
KERNEL(add, CudaTools::Kernel::basic(1), 1, 1); // Prints 2.
return 0;
}// Makefile
CC := g++-10
NVCC := nvcc
CFLAGS := -Wall -std=c++17 -fopenmp -MMD
NVCC_FLAGS := -MMD -w -Xcompiler
INCLUDE := ../../
LIBS_DIR :=
LIBS_DIR_GPU := /usr/local/cuda/lib64
LIBS :=
LIBS_GPU := cuda cudart cublas
TARGET = coreKernel
SRC_DIR = .
BUILD_DIR = build
The lines above are the first few lines of the Makefile,
which are the only lines you should need to modify, consisting of
libraries and flags, as well as the name of the target.
Notes
Complex Numbers
Dealing with complex numbers is slightly complicated, trying to
enforce compatability between two systems and several different
libraries which many not have the right support. We create a simple
barebones host and device compatible complex number class following the
same as cuComplex.h, but with proper C++ operator
overloading and class structure. However, while the underlying data
structure is identical to all other complex number structures, there is
a lot of type-casting done underneath the hood to get cuBLAS and Eigen
to work well together, while maintaining one 'unified' complex type.
As a result, there could be some issues and lack of functionality
with this at the moment. For now, it's recommended to use the given
complex64 and complex128 types which should
properly adapt and work.