Skip to content

Commit

Permalink
WIP
Browse files Browse the repository at this point in the history
  • Loading branch information
neon60 committed Sep 2, 2024
1 parent 7eca042 commit 7b45147
Show file tree
Hide file tree
Showing 4 changed files with 229 additions and 124 deletions.
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ On non-AMD platforms, like NVIDIA, HIP provides header files required to support
:::{grid-item-card} Conceptual

* {doc}`./understand/programming_model`
* {doc}`./understand/programming_interface`
* {doc}`./understand/hip_runtime_api`
* {doc}`./understand/hardware_implementation`
* {doc}`./understand/amd_clr`

Expand Down
2 changes: 1 addition & 1 deletion docs/sphinx/_toc.yml.in
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ subtrees:
- caption: Conceptual
entries:
- file: understand/programming_model
- file: understand/programming_interface
- file: understand/hip_runtime_api
- file: understand/hardware_implementation
- file: understand/amd_clr

Expand Down
227 changes: 227 additions & 0 deletions docs/understand/hip_runtime_api.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,227 @@
.. meta::
:description: This chapter describes the HIP runtime API and the compilation workflow of the HIP compilers.
:keywords: AMD, ROCm, HIP, CUDA, HIP runtime API

.. _hip_runtime_api_understand:

*******************************************************************************
HIP runtime API
*******************************************************************************

The HIP runtime API is the programming interface, which provides C and C++
functions for event, stream, device and memory managements, etc. The HIP
runtime on AMD platform uses the :doc:`Common Language Runtime (CLR) <hip:understand/amd_clr>`,
while on NVIDIA platform HIP runtime is only a thin layer over the CUDA runtime
or Driver API.

- **CLR** contains source codes for AMD's compute languages runtimes: ``HIP``
and ``OpenCL™``. CLR includes the implementation of ``HIP`` language on the AMD
platform `hipamd <https://github.com/ROCm/clr/tree/develop/hipamd>`_ and the
Radeon Open Compute Common Language Runtime (rocclr). rocclr is a virtual device
interface, that HIP runtime interact with different backends such as ROCr on
Linux or PAL on Windows. (CLR also include the implementation of `OpenCL <https://github.com/ROCm/clr/tree/develop/opencl>`_,
while it's interact with ROCr and PAL)
- **CUDA runtime** is built over the CUDA driver API (lower-level C API). For
further information about the CUDA driver and runtime API, check the :doc:`hip:how-to/hip_porting_driver_api`.
On non-AMD platform, HIP runtime determines, if CUDA is available and can be
used. If available, HIP_PLATFORM is set to ``nvidia`` and underneath CUDA path
is used.

The different runtimes interactions are represented on the following figure.

.. figure:: ../data/understand/programming_interface/runtimes.svg

.. note::

The CUDA specific headers can be found in the `hipother repository <https://github.com/ROCm/hipother>`_.

HIP compilers
=============

The HIP runtime API and HIP C++ extensions are available with HIP compilers. On
AMD platform ROCm currently provides two compiler interfaces: ``hipcc`` and
``amdclang++``. The ``hipcc`` command-line interface aims to provide a more
familiar user interface to users who are experienced in CUDA but relatively new
to the ROCm/HIP development environment. On the other hand, ``amdclang++``
provides a user interface identical to the clang++ compiler. (For further
details, check :doc:`llvm <llvm-project:index>`). On NVIDIA platform ``hipcc``
invoke the locally installed ``NVCC`` compiler, while on AMD platform it's
invoke ``amdclang++``.

.. Need to update the link later.
For AMD compiler options, check the :doc:`GPU compiler option page <llvm-project:reference/rocmcc>`.

HIP compilation workflow
------------------------

The source code compiled with HIP compilers are separated to device code and
host. The HIP compilers:

* Compiling the device code into an assembly.
* Modify the host code by replacing the ``<<<...>>>`` syntax introduced in
kernels by the necessary CUDA runtime function calls to load and launch each
compiled kernel from the ``PTX`` code and/or ``cubin`` object.

``NVCC`` and ``amdclang++`` target different architectures and use different
code object formats: ``NVCC`` is ``cubin`` or ``ptx`` files, while the
``amdclang++`` path is the ``hsaco`` format.

For example of compiling from command line, check the :ref:`SAXPY tutorial compiling <compiling_on_the_command_line>` .

.. _driver_api_understand:

Driver API
===========

The driver API offers developers low-level control over GPU operations, enabling
them to manage GPU resources, load and launch kernels, and handle memory
explicitly. In HIP the Driver API is part of the runtime API, while the CUDA's
driver API is separate from CUDA runtime API.

One significant advantage of the driver API is its ability to dynamically load
and manage code objects, which is particularly useful for applications that need
to generate or modify kernels at runtime. This flexibility allows for more
sophisticated and adaptable GPU programming.

Unlike the runtime API, the driver API does not automatically handle tasks such
as context creation and kernel loading. While the runtime API is more convenient
and easier to use for most applications, the driver API provides greater control
and can be more efficient for complex or performance-critical applications.

Using the driver API can result in longer development times due to the need for
more detailed code and explicit management. However, the actual runtime
performance can be similar to or even better than the runtime API, depending on
how well the application is optimized.

For further details, check :ref:`porting_driver_api`, and :ref:`driver_api_reference`.

Execution Control
=================

Device memory
=============

Device memory exists on the device (e.g. GPU) of the machine in video random
access memory (VRAM). Recent architectures use graphics double data rate (GDDR)
synchronous dynamic random-access memory (SDRAM)such as GDDR6, or high-bandwidth
memory (HBM) such as HBM2e.

Global Memory
--------------

Read-write storage visible to all threads in a given grid. There are specialized
versions of global memory with different usage semantics which are typically
backed by the same hardware storing global.

Shared Memory
-------------

Read-write storage visible to all the threads in a given block.

Local or per-thread memory
--------------------------

Read-write storage only visible to the threads defining the given variables,
also called per-thread memory. The size of a block for a given kernel, and thereby
the number of concurrent warps, are limited by local memory usage.
This relates to an important aspect: occupancy. This is the default memory
namespace.

Constant Memory
---------------

Read-only storage visible to all threads in a given grid. It is a limited
segment of global with queryable size.

Texture Memory
--------------------------

Read-only storage visible to all threads in a given grid and accessible
through additional APIs.

Surface Memory
--------------------------

A read-write version of texture memory.

Memory Management
=================

Managed Memory
--------------

In conventional architectures, CPUs and GPUs have dedicated memory like Random
Access Memory (RAM) and Video Random Access Memory (VRAM). This architectural
design, while effective, can be limiting in terms of memory capacity and
bandwidth, as continuous memory copying is required to allow the processors to
access the appropriate data. New architectural features like Heterogeneous
System Architectures (HSA) and Unified Memory (UM) help avoid these limitations
and promise increased efficiency and innovation.

Stream Ordered Memory Allocator
--------------------------------

Stream Ordered Memory Allocator (SOMA) provides an asynchronous memory
allocation mechanism with stream-ordering semantics. You can use SOMA to
allocate and free memory in stream order, which ensures that all asynchronous
accesses occur between the stream executions of allocation and deallocation.
Compliance with stream order prevents use-before-allocation or use-after-free
errors, which helps to avoid an undefined behavior.


Virtual Memory Management
--------------------------------

Memory management is important when creating high-performance applications in
the HIP ecosystem. Both allocating and copying memory can result in bottlenecks,
which can significantly impact performance.

Global memory allocation in HIP uses the C language style allocation function.
This works fine for simple cases but can cause problems if your memory needs
change. If you need to increase the size of your memory, you must allocate a
second larger buffer and copy the data to it before you can free the original
buffer. This increases overall memory usage and causes unnecessary ``memcpy``
calls. Another solution is to allocate a larger buffer than you initially need.
However, this isn't an efficient way to handle resources and doesn't solve the
issue of reallocation when the extra buffer runs out.

Virtual memory management solves these memory management problems. It helps to
reduce memory usage and unnecessary ``memcpy`` calls.

Texture Management
----------------------

* Global enum and defines
* Initialization and Version
*
*
* Error Handling

Stream Management
=================

Stream management refers to the mechanisms that allow developers to control the
order and concurrency of kernel executions and memory transfers on the GPU.
Stream is linear sequence of execution which belonging to a specific GPU.
Different streams can execute operations concurrently on the same GPU, which can
lead to better utilization of the device.

Stream management allows developers to optimize GPU workloads by enabling
concurrent execution of tasks, overlapping computation with memory transfers,
and controlling the order of operations. The priority can be also set, which
gives extra flexibility in the developers hand.


Callback Activity APIs
=================

Graph Management
=================

OpenGL Interop
================

Surface Object
=================

For further details, check `HIP Runtime API Reference <doxygen/html/index.html>`_.
122 changes: 0 additions & 122 deletions docs/understand/programming_interface.rst

This file was deleted.

0 comments on commit 7b45147

Please sign in to comment.