Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion cuda_bindings/docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
# ones.
extensions = ["sphinx.ext.autodoc", "sphinx.ext.napoleon", "myst_nb", "enum_tools.autoenum"]

jupyter_execute_notebooks = "force"
nb_execution_mode = "off"
numfig = True

# Add any paths that contain templates here, relative to this directory.
Expand Down
24 changes: 12 additions & 12 deletions cuda_bindings/docs/source/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ API](https://docs.nvidia.com/cuda/cuda-driver-api/index.html) and
Python package. In this example, you copy data from the host to device. You need
[NumPy](https://numpy.org/doc/stable/contents.html) to store data on the host.

```{code-cell} python
```python
from cuda.bindings import driver, nvrtc
import numpy as np
```
Expand All @@ -58,7 +58,7 @@ example is provided.
In a future release, this may automatically raise exceptions using a Python
object model.

```{code-cell} python
```python
def _cudaGetErrorEnum(error):
if isinstance(error, driver.CUresult):
err, name = driver.cuGetErrorName(error)
Expand Down Expand Up @@ -86,7 +86,7 @@ Python that requires some understanding of CUDA C++. For more information, see
[An Even Easier Introduction to
CUDA](https://developer.nvidia.com/blog/even-easier-introduction-cuda/).

```{code-cell} python
```python
saxpy = """\
extern "C" __global__
void saxpy(float a, float *x, float *y, float *out, size_t n)
Expand All @@ -108,7 +108,7 @@ In the following code example, the Driver API is initialized so that the NVIDIA
and GPU are accessible. Next, the GPU is queried for their compute capability. Finally,
the program is compiled to target our local compute capability architecture with FMAD enabled.

```{code-cell} python
```python
# Initialize CUDA Driver API
checkCudaErrors(driver.cuInit(0))

Expand Down Expand Up @@ -138,7 +138,7 @@ context. CUDA contexts are analogous to host processes for the device. In the
following code example, a handle for compute device 0 is passed to
`cuCtxCreate` to designate that GPU for context creation.

```{code-cell} python
```python
# Create context
context = checkCudaErrors(driver.cuCtxCreate(0, cuDevice))
```
Expand All @@ -148,7 +148,7 @@ module. A module is analogous to dynamically loaded libraries for the device.
After loading into the module, extract a specific kernel with
`cuModuleGetFunction`. It is not uncommon for multiple kernels to reside in PTX.

```{code-cell} python
```python
# Load PTX as module data and retrieve function
ptx = np.char.array(ptx)
# Note: Incompatible --gpu-architecture would be detected here
Expand All @@ -161,7 +161,7 @@ application performance, you can input data on the device to eliminate data
transfers. For completeness, this example shows how you would transfer data to
and from the device.

```{code-cell} python
```python
NUM_THREADS = 512 # Threads per block
NUM_BLOCKS = 32768 # Blocks per grid

Expand All @@ -184,7 +184,7 @@ Python doesn’t have a natural concept of pointers, yet `cuMemcpyHtoDAsync` exp
`void*`. Therefore, `XX.ctypes.data` retrieves the pointer value associated with
XX.

```{code-cell} python
```python
dXclass = checkCudaErrors(driver.cuMemAlloc(bufferSize))
dYclass = checkCudaErrors(driver.cuMemAlloc(bufferSize))
dOutclass = checkCudaErrors(driver.cuMemAlloc(bufferSize))
Expand All @@ -209,7 +209,7 @@ Like `cuMemcpyHtoDAsync`, `cuLaunchKernel` expects `void**` in the argument list
the earlier code example, it creates `void**` by grabbing the `void*` value of each
individual argument and placing them into its own contiguous memory.

```{code-cell} python
```python
# The following code example is not intuitive
# Subject to change in a future release
dX = np.array([int(dXclass)], dtype=np.uint64)
Expand All @@ -222,7 +222,7 @@ args = np.array([arg.ctypes.data for arg in args], dtype=np.uint64)

Now the kernel can be launched:

```{code-cell} python
```python
checkCudaErrors(driver.cuLaunchKernel(
kernel,
NUM_BLOCKS, # grid x dim
Expand Down Expand Up @@ -251,7 +251,7 @@ stream are serialized. After the call to transfer data back to the host is
executed, `cuStreamSynchronize` is used to halt CPU execution until all operations
in the designated stream are finished.

```{code-cell} python
```python
# Assert values are same after running kernel
hZ = a * hX + hY
if not np.allclose(hOut, hZ):
Expand All @@ -261,7 +261,7 @@ if not np.allclose(hOut, hZ):
Perform verification of the data to ensure correctness and finish the code with
memory clean up.

```{code-cell} python
```python
checkCudaErrors(driver.cuStreamDestroy(stream))
checkCudaErrors(driver.cuMemFree(dXclass))
checkCudaErrors(driver.cuMemFree(dYclass))
Expand Down
Loading