From 7b45147cd38033d7e6039fa76d7b23ddd839e25c Mon Sep 17 00:00:00 2001
From: Istvan Kiss <neon60@gmail.com>
Date: Thu, 22 Aug 2024 22:30:06 +0200
Subject: [PATCH] WIP

---
 docs/index.md                             |   2 +-
 docs/sphinx/_toc.yml.in                   |   2 +-
 docs/understand/hip_runtime_api.rst       | 227 ++++++++++++++++++++++
 docs/understand/programming_interface.rst | 122 ------------
 4 files changed, 229 insertions(+), 124 deletions(-)
 create mode 100644 docs/understand/hip_runtime_api.rst
 delete mode 100644 docs/understand/programming_interface.rst

diff --git a/docs/index.md b/docs/index.md
index b79d342b86..6f8cb91997 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -30,7 +30,7 @@ On non-AMD platforms, like NVIDIA, HIP provides header files required to support
 :::{grid-item-card} Conceptual
 
 * {doc}`./understand/programming_model`
-* {doc}`./understand/programming_interface`
+* {doc}`./understand/hip_runtime_api`
 * {doc}`./understand/hardware_implementation`
 * {doc}`./understand/amd_clr`
 
diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in
index d762dd909c..cf5a4f5a60 100644
--- a/docs/sphinx/_toc.yml.in
+++ b/docs/sphinx/_toc.yml.in
@@ -16,7 +16,7 @@ subtrees:
 - caption: Conceptual
   entries:
   - file: understand/programming_model
-  - file: understand/programming_interface
+  - file: understand/hip_runtime_api
   - file: understand/hardware_implementation
   - file: understand/amd_clr
 
diff --git a/docs/understand/hip_runtime_api.rst b/docs/understand/hip_runtime_api.rst
new file mode 100644
index 0000000000..87cf2d03be
--- /dev/null
+++ b/docs/understand/hip_runtime_api.rst
@@ -0,0 +1,227 @@
+.. meta::
+  :description: This chapter describes the HIP runtime API and the compilation workflow of the HIP compilers.
+  :keywords: AMD, ROCm, HIP, CUDA, HIP runtime API
+
+.. _hip_runtime_api_understand:
+
+*******************************************************************************
+HIP runtime API
+*******************************************************************************
+
+The HIP runtime API is the programming interface, which provides C and C++
+functions for event, stream,  device and memory managements, etc. The HIP
+runtime on AMD platform uses the :doc:`Common Language Runtime (CLR) <hip:understand/amd_clr>`,
+while on NVIDIA platform HIP runtime is only a thin layer over the CUDA runtime
+or Driver API.
+
+- **CLR** contains source codes for AMD's compute languages runtimes: ``HIP``
+  and ``OpenCL™``. CLR includes the implementation of ``HIP`` language on the AMD
+  platform `hipamd <https://github.com/ROCm/clr/tree/develop/hipamd>`_ and the
+  Radeon Open Compute Common Language Runtime (rocclr). rocclr is a virtual device
+  interface, that HIP runtime interact with different backends such as ROCr on
+  Linux or PAL on Windows. (CLR also include the implementation of `OpenCL <https://github.com/ROCm/clr/tree/develop/opencl>`_,
+  while it's interact with ROCr and PAL)
+- **CUDA runtime** is built over the CUDA driver API (lower-level C API). For
+  further information about the CUDA driver and runtime API, check the :doc:`hip:how-to/hip_porting_driver_api`.
+  On non-AMD platform, HIP runtime determines, if CUDA is available and can be
+  used. If available, HIP_PLATFORM is set to ``nvidia`` and underneath CUDA path
+  is used.
+
+The different runtimes interactions are represented on the following figure.
+
+.. figure:: ../data/understand/programming_interface/runtimes.svg
+
+.. note::
+
+  The CUDA specific headers can be found in the `hipother repository <https://github.com/ROCm/hipother>`_.
+
+HIP compilers
+=============
+
+The HIP runtime API and HIP C++ extensions are available with HIP compilers. On
+AMD platform ROCm currently provides two compiler interfaces: ``hipcc`` and
+``amdclang++``. The ``hipcc`` command-line interface aims to provide a more
+familiar user interface to users who are experienced in CUDA but relatively new
+to the ROCm/HIP development environment. On the other hand, ``amdclang++``
+provides a user interface identical to the clang++ compiler. (For further
+details, check :doc:`llvm <llvm-project:index>`). On NVIDIA platform ``hipcc``
+invoke the locally installed ``NVCC`` compiler, while on AMD platform it's
+invoke ``amdclang++``.
+
+.. Need to update the link later.
+For AMD compiler options, check the :doc:`GPU compiler option page <llvm-project:reference/rocmcc>`.
+
+HIP compilation workflow
+------------------------
+
+The source code compiled with HIP compilers are separated to device code and
+host. The HIP compilers:
+
+* Compiling the device code into an assembly.
+* Modify the host code by replacing the ``<<<...>>>`` syntax introduced in
+  kernels by the necessary CUDA runtime function calls to load and launch each
+  compiled kernel from the ``PTX`` code and/or ``cubin`` object.
+
+``NVCC`` and ``amdclang++`` target different architectures and use different
+code object formats: ``NVCC`` is ``cubin`` or ``ptx`` files, while the
+``amdclang++`` path is the ``hsaco`` format.
+
+For example of compiling from command line, check the :ref:`SAXPY tutorial compiling <compiling_on_the_command_line>` .
+
+.. _driver_api_understand:
+
+Driver API 
+===========
+
+The driver API offers developers low-level control over GPU operations, enabling
+them to manage GPU resources, load and launch kernels, and handle memory
+explicitly. In HIP the Driver API is part of the runtime API, while the CUDA's
+driver API is separate from CUDA runtime API.
+
+One significant advantage of the driver API is its ability to dynamically load
+and manage code objects, which is particularly useful for applications that need
+to generate or modify kernels at runtime. This flexibility allows for more
+sophisticated and adaptable GPU programming.
+
+Unlike the runtime API, the driver API does not automatically handle tasks such
+as context creation and kernel loading. While the runtime API is more convenient
+and easier to use for most applications, the driver API provides greater control
+and can be more efficient for complex or performance-critical applications.
+
+Using the driver API can result in longer development times due to the need for
+more detailed code and explicit management. However, the actual runtime
+performance can be similar to or even better than the runtime API, depending on
+how well the application is optimized.
+
+For further details, check :ref:`porting_driver_api`, and :ref:`driver_api_reference`.
+
+Execution Control
+=================
+
+Device memory
+=============
+
+Device memory exists on the device (e.g. GPU) of the machine in video random
+access memory (VRAM). Recent architectures use graphics double data rate (GDDR)
+synchronous dynamic random-access memory (SDRAM)such as GDDR6, or high-bandwidth
+memory (HBM) such as HBM2e.
+
+Global Memory 
+--------------
+
+Read-write storage visible to all threads in a given grid. There are specialized
+versions of global memory with different usage semantics which are typically
+backed by the same hardware storing global.
+
+Shared Memory
+-------------
+
+Read-write storage visible to all the threads in a given block.
+
+Local or per-thread memory
+--------------------------
+
+Read-write storage only visible to the threads defining the given variables,
+also called per-thread memory. The size of a block for a given kernel, and thereby
+the number of concurrent warps, are limited by local memory usage.
+This relates to an important aspect: occupancy. This is the default memory
+namespace.
+
+Constant Memory
+---------------
+
+Read-only storage visible to all threads in a given grid. It is a limited
+segment of global with queryable size.
+
+Texture Memory
+--------------------------
+
+Read-only storage visible to all threads in a given grid and accessible
+through additional APIs.
+
+Surface Memory
+--------------------------
+
+A read-write version of texture memory.
+
+Memory Management
+=================
+
+Managed Memory
+--------------
+
+In conventional architectures, CPUs and GPUs have dedicated memory like Random
+Access Memory (RAM) and Video Random Access Memory (VRAM). This architectural
+design, while effective, can be limiting in terms of memory capacity and
+bandwidth, as continuous memory copying is required to allow the processors to
+access the appropriate data. New architectural features like Heterogeneous
+System Architectures (HSA) and Unified Memory (UM) help avoid these limitations
+and promise increased efficiency and innovation.
+
+Stream Ordered Memory Allocator
+--------------------------------
+
+Stream Ordered Memory Allocator (SOMA) provides an asynchronous memory
+allocation mechanism with stream-ordering semantics. You can use SOMA to
+allocate and free memory in stream order, which ensures that all asynchronous
+accesses occur between the stream executions of allocation and deallocation.
+Compliance with stream order prevents use-before-allocation or use-after-free
+errors, which helps to avoid an undefined behavior.
+
+
+Virtual Memory Management
+--------------------------------
+
+Memory management is important when creating high-performance applications in
+the HIP ecosystem. Both allocating and copying memory can result in bottlenecks,
+which can significantly impact performance.
+
+Global memory allocation in HIP uses the C language style allocation function.
+This works fine for simple cases but can cause problems if your memory needs
+change. If you need to increase the size of your memory, you must allocate a
+second larger buffer and copy the data to it before you can free the original
+buffer. This increases overall memory usage and causes unnecessary ``memcpy``
+calls. Another solution is to allocate a larger buffer than you initially need.
+However, this isn't an efficient way to handle resources and doesn't solve the
+issue of reallocation when the extra buffer runs out.
+
+Virtual memory management solves these memory management problems. It helps to
+reduce memory usage and unnecessary ``memcpy`` calls.
+
+Texture Management
+----------------------
+
+* Global enum and defines
+* Initialization and Version
+* 
+* 
+* Error Handling
+
+Stream Management
+=================
+
+Stream management refers to the mechanisms that allow developers to control the
+order and concurrency of kernel executions and memory transfers on the GPU.
+Stream is linear sequence of execution which belonging to a specific GPU. 
+Different streams can execute operations concurrently on the same GPU, which can
+lead to better utilization of the device.
+
+Stream management allows developers to optimize GPU workloads by enabling
+concurrent execution of tasks, overlapping computation with memory transfers,
+and controlling the order of operations. The priority can be also set, which 
+gives extra flexibility in the developers hand.
+
+
+Callback Activity APIs
+=================
+
+Graph Management
+=================
+
+OpenGL Interop
+================
+
+Surface Object
+=================
+
+For further details, check `HIP Runtime API Reference <doxygen/html/index.html>`_.
\ No newline at end of file
diff --git a/docs/understand/programming_interface.rst b/docs/understand/programming_interface.rst
deleted file mode 100644
index d488351328..0000000000
--- a/docs/understand/programming_interface.rst
+++ /dev/null
@@ -1,122 +0,0 @@
-.. meta::
-  :description: This chapter describes the HIP runtime API and the compilation workflow of the HIP compilers.
-  :keywords: AMD, ROCm, HIP, CUDA, HIP runtime API
-
-*******************************************************************************
-Programming interface
-*******************************************************************************
-
-The programming interface document will focus on the HIP runtime API. The
-runtime API provides C and C++ functions for event, stream,  device and memory
-managements, etc. The HIP runtime on AMD platform uses the Common Language
-Runtimes (CLR), while on NVIDIA platform HIP runtime is only a thin layer over
-the CUDA runtime.
-
-- **CLR** contains source codes for AMD's compute languages runtimes: ``HIP``
-  and ``OpenCL™``. CLR includes the implementation of ``HIP`` language on the AMD
-  platform `hipamd <https://github.com/ROCm/clr/tree/develop/hipamd>`_ and the
-  Radeon Open Compute Common Language Runtime (rocclr). rocclr is a virtual device
-  interface, that HIP runtime interact with different backends such as ROCr on
-  Linux or PAL on Windows. (CLR also include the implementation of `OpenCL <https://github.com/ROCm/clr/tree/develop/opencl>`_,
-  while it's interact with ROCr and PAL)
-- **CUDA runtime** is built over the CUDA driver API (lower-level C API). For
-  further information about the CUDA driver and runtime API, check the :doc:`hip:how-to/hip_porting_driver_api`.
-  On non-AMD platform, HIP runtime determines, if CUDA is available and can be
-  used. If available, HIP_PLATFORM is set to ``nvidia`` and underneath CUDA path
-  is used.
-
-.. I am not sure we should share this.
-The different runtimes interactions are represented on the following figure.
-
-.. figure:: ../data/understand/programming_interface/runtimes.svg
-
-HIP compilers
-=============
-
-The HIP runtime API and HIP C++ extensions are available with HIP compilers. On
-AMD platform ROCm currently provides two compiler interfaces: ``hipcc`` and
-``amdclang++``. The ``hipcc`` command-line interface aims to provide a more
-familiar user interface to users who are experienced in CUDA but relatively new
-to the ROCm/HIP development environment. On the other hand, ``amdclang++``
-provides a user interface identical to the clang++ compiler. (For further
-details, check `llvm <llvm-project-docs:index>`_). On NVIDIA platform ``hipcc``
-invoke the locally installed ``NVCC`` compiler, while on AMD platform it's
-invoke ``amdclang++``.
-
-.. Need to update the link later.
-For AMD compiler options, check the `GPU compiler option page <https://rocm.docs.amd.com/en/docs-5.2.3/reference/rocmcc/rocmcc.html#amd-gpu-compilation>`_.
-
-HIP compilation workflow
-------------------------
-
-The source code compiled with HIP compilers are separated to device code and
-host. The HIP compilers:
-
-* Compiling the device code into an assembly.
-* Modify the host code by replacing the ``<<<...>>>`` syntax introduced in
-  kernels by the necessary CUDA runtime function calls to load and launch each
-  compiled kernel from the ``PTX`` code and/or ``cubin`` object.
-
-``NVCC`` and ``amdclang++`` target different architectures and use different
-code object formats: ``NVCC`` is ``cubin`` or ``ptx`` files, while the
-``amdclang++`` path is the ``hsaco`` format.
-
-For example of compiling from command line, check the :ref:` SAXPY tutorial compiling <compiling_on_the_command_line>`.
-
-.. _hip_runtime_api_understand:
-
-HIP runtime API
-===============
-
-For the AMD ROCm platform, HIP provides headers and a runtime library built on
-top of HIP-Clang compiler in the repository
-:doc:`Common Language Runtime (CLR) <hip:understand/amd_clr>`. The HIP runtime
-implements HIP streams, events, and memory APIs, and is an object library that
-is linked with the application. The source code for all headers and the library
-implementation is available on GitHub.
-
-For the NVIDIA CUDA platform, HIP provides headers that translate from the
-HIP runtime API to the CUDA runtime API. The host-side contains mostly inlined
-wrappers or even just preprocessor defines, with no additional overhead.
-The device-side code is compiled with ``nvcc``, just like normal CUDA kernels,
-and therefore one can expect the same performance as if directly coding in CUDA.
-The CUDA specific headers can be found in the `hipother repository <https://github.com/ROCm/hipother>`_.
-
-For further details, check `HIP Runtime API Reference <doxygen/html/index.html>`_.
-
-.. _driver_api_understand:
-
-Driver API 
-===========
-
-The driver API offers developers low-level control over GPU operations, enabling
-them to manage GPU resources, load and launch kernels, and handle memory
-explicitly. This API is more flexible and powerful compared to the runtime API,
-but it requires a deeper understanding of the GPU architecture and more detailed
-management.
-
-One significant advantage of the driver API is its ability to dynamically load
-and manage code objects, which is particularly useful for applications that need
-to generate or modify kernels at runtime. This flexibility allows for more
-sophisticated and adaptable GPU programming.
-
-Memory management with the driver API involves explicit allocation,
-de-allocation, and data transfer operations. This level of control can lead to
-optimized performance for specific applications, as developers can fine-tune
-memory usage. However, it also demands careful handling to avoid memory leaks
-and ensure efficient memory utilization.
-
-Unlike the runtime API, the driver API does not automatically handle tasks such
-as context creation and kernel loading. While the runtime API is more convenient
-and easier to use for most applications, the driver API provides greater control
-and can be more efficient for complex or performance-critical applications.
-
-Using the driver API can result in longer development times due to the need for
-more detailed code and explicit management. However, the actual runtime
-performance can be similar to or even better than the runtime API, depending on
-how well the application is optimized.
-
-While AMD HIP does not have a direct equivalent to CUDA's Driver API, it
-supports driver API functionalities, such as managing contexts, modules, memory,
-and driver entry point access. These features are detailed in
-:ref:`porting_driver_api`, and described in :ref:`driver_api_reference`.