Update memory management page

ROCm · Sep 30, 2024 · 2e4c7a0 · 2e4c7a0
1 parent 79fd95d
commit 2e4c7a0
Show file tree

Hide file tree

Showing 2 changed files with 150 additions and 123 deletions.
diff --git a/docs/how-to/hip_runtime_api/memory_management.rst b/docs/how-to/hip_runtime_api/memory_management.rst
@@ -16,6 +16,51 @@ its own distinct memory. Kernels execute mainly on device memory, the runtime
 offers functions for allocating, deallocating, and copying device memory, along
 with transferring data between host and device memory.
 
+Device memory
+================================================================================
+
+Device memory exists on the device, e.g. on GPUs in the video random access
+memory (VRAM), and is accessible by the kernels operating on the device. Recent
+architectures use graphics double data rate (GDDR) synchronous dynamic
+random-access memory (SDRAM) such as GDDR6, or high-bandwidth memory (HBM) such
+as HBM2e. Device memory can be allocated as global memory, constant, texture or
+surface memory.
+
+Global memory
+--------------------------------------------------------------------------------
+
+Read-write storage visible to all threads on a given device. There are
+specialized versions of global memory with different usage semantics which are
+typically backed by the same hardware, but can use different caching paths.
+
+Constant memory
+--------------------------------------------------------------------------------
+
+Read-only storage visible to all threads on a given device. It is a limited
+segment backed by device memory with queryable size. It needs to be set by the
+host before kernel execution. Constant memory provides the best performance
+benefit when all threads within a warp access the same address.
+
+Texture memory
+--------------------------------------------------------------------------------
+
+Read-only storage visible to all threads on a given device and accessible
+through additional APIs. Its origins come from graphics APIs, and provides
+performance benefits when accessing memory in a pattern where the
+addresses are close to each other in a 2D representation of the memory. 
+
+The texture management module of HIP runtime API contains the functions of
+texture memory.
+
+Surface memory
+--------------------------------------------------------------------------------
+
+A read-write version of texture memory, which can be useful for applications
+that require direct manipulation of 1D, 2D, or 3D hipArray_t. 
+
+The surface objects module of HIP runtime API contains the functions for surface
+memory create, destroy, read and write.
+
 Host Memory
 ================================================================================
 
@@ -168,19 +213,6 @@ The example code how to use pinned memory in HIP showed at the following example
 The pinned memory allocation is effected with different flags, which details
 described at :ref:`memory_allocation_flags`.
 
-The pinned memory can coherent and non-coherent:
-
-* Coherent host memory supports fine-grain synchronization while the kernel is
-  running. This is the default and is the easiest to use since the memory
-  is visible to the CPU at typical synchronization points. This memory allows
-  in-kernel synchronization commands such as :cpp:func:`threadfence_system` to
-  work transparently.
-* Non-coherent memory can be cached by GPU, but cannot support synchronization
-  while the kernel is running. This can provide performance benefit,
-  but care must be taken to use the correct synchronization.
-
-For further details, check :ref:`coherency_controls`.
-
 .. _memory_allocation_flags:
 
 Memory allocation flags of pinned memory
@@ -193,88 +225,128 @@ host memory:
   not just the one on which the allocation is made.
 * ``hipHostMallocMapped``: Map the allocation into the address space for
   the current device, and the device pointer can be obtained with
-  ``hipHostGetDevicePointer()``.
+  :cpp:func:`hipHostGetDevicePointer`.
 * ``hipHostMallocNumaUser``: The flag to allow host memory allocation to
-  follow Numa policy by user. Please note this flag is currently only applicable
-  on Linux, under development on Windows.
+  follow Numa policy by user. Target of Numa policy is to select a CPU that is
+  closest to each GPU. Numa distance is the measurement of how far between GPU
+  and CPU devices.
+* ``hipHostMallocWriteCombined``: Allocates the memory as write-combined. On
+  some system configurations, write-combined allocation may be transferred
+  faster across the PCI Express bus, however, could have low read efficiency by
+  most CPUs. It's a good option for data transfer from host to device via mapped
+  pinned memory.
+* ``hipHostMallocCoherent``: Allocate coherent memory. Overrides
+  HIP_COHERENT_HOST_ALLOC for specific allocation. For further details, check 
+  :ref:`coherency_controls`.
+* ``hipHostMallocNonCoherent``: Allocate non-coherent memory. Overrides
+  HIP_COHERENT_HOST_ALLOC for specific allocation. For further details, check 
+  :ref:`coherency_controls`.
 
 All allocation flags are independent, and can be used in any combination without
-restriction, for instance, ``hipHostMalloc`` can be called with both
+restriction, for instance, :cpp:func:`hipHostMalloc` can be called with both
 ``hipHostMallocPortable`` and ``hipHostMallocMapped`` flags set. Both usage
 models described above use the same allocation flags, and the difference is in
 how the surrounding code uses the host memory.
 
-Numa-aware host memory allocation
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Numa policy determines how memory is allocated. Target of Numa policy is to
-select a CPU that is closest to each GPU. Numa distance is the measurement of
-how far between GPU and CPU devices.
-
-By default, each GPU selects a Numa CPU node that has the least Numa distance
-between them, that is, host memory will be automatically allocated closest on
-the memory pool of Numa node of the current GPU device. Using
-:cpp:func:`hipSetDevice` API to a different GPU will still be able to access the
-host allocation, but can have longer Numa distance. 
-
 .. note:: 
 
+  By default, each GPU selects a Numa CPU node that has the least Numa distance
+  between them, that is, host memory will be automatically allocated closest on
+  the memory pool of Numa node of the current GPU device. Using
+  :cpp:func:`hipSetDevice` API to a different GPU will still be able to access
+  the host allocation, but can have longer Numa distance. 
+
   Numa policy is implemented on Linux and is under development on Microsoft
   Windows.
 
 .. _coherency_controls:
 
 Coherency controls
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+--------------------------------------------------------------------------------
 
-ROCm defines two coherency options for host memory:
+AMD GPUs can have two different types of memory coherence:
+
+* **Coarse-grained coherence** means that memory is only considered up to date at 
+  kernel boundaries, which can be enforced through hipDeviceSynchronize,
+  hipStreamSynchronize, or any blocking operation that acts on the null
+  stream (e.g. hipMemcpy). For example, cacheable memory is a type of
+  coarse-grained memory where an up-to-date copy of the data can be stored
+  elsewhere (e.g. in an L2 cache).
+* **Fine-grained coherence** means the coherence is supported while a CPU/GPU 
+  kernel is running. This can be useful if both host and device are operating on
+  the same dataspace using system-scope atomic operations (e.g. updating an
+  error code or flag to a buffer). Fine-grained memory implies that up-to-date
+  data may be made visible to others regardless of kernel boundaries as
+  discussed above.
 
-* Coherent memory : Supports fine-grain synchronization while the kernel is
-  running.  For example, a kernel can perform atomic operations that are
-  visible to the host CPU or to other (peer) GPUs.  Synchronization instructions
-  include ``threadfence_system`` and C++11-style atomic operations. 
-
-In order to achieve this fine-grained coherence, many AMD GPUs use a limited
-cache policy, such as leaving these allocations uncached by the GPU, or making
-them read-only.
-
-* Non-coherent memory : Can be cached by GPU, but cannot support synchronization
-  while the kernel is running.  Non-coherent memory can be optionally
-  synchronized only at command (end-of-kernel or copy command) boundaries. This
-  memory is appropriate for high-performance access when fine-grain
-  synchronization is not required.
-
-HIP provides the developer with controls to select which type of memory is used
-via allocation flags passed to :cpp:func:`hipHostMalloc` and the
-``HIP_HOST_COHERENT`` environment variable. By default, the environment variable
-``HIP_HOST_COHERENT`` is set to 0 in HIP. 
-
-The control logic in the current version of HIP is as follows:
-
-* No flags are passed in: the host memory allocation is coherent, the 
-  ``HIP_HOST_COHERENT`` environment variable is ignored.
-* ``hipHostMallocCoherent=1``: The host memory allocation will be coherent, the
-  ``HIP_HOST_COHERENT`` environment variable is ignored.
-* ``hipHostMallocMapped=1``: The host memory allocation will be coherent, the
-  ``HIP_HOST_COHERENT`` environment variable is ignored.
-* ``hipHostMallocNonCoherent=1``, ``hipHostMallocCoherent=0``, and
-  ``hipHostMallocMapped=0``: The host memory will be non-coherent, the
-  ``HIP_HOST_COHERENT`` environment variable is ignored.
-* ``hipHostMallocCoherent=0``, ``hipHostMallocNonCoherent=0``,
-  ``hipHostMallocMapped=0``, but one of the other ``HostMalloc`` flags is set:
-
-  * If ``HIP_HOST_COHERENT`` is defined as 1, the host memory allocation is
-    coherent.
-  * If ``HIP_HOST_COHERENT`` is not defined, or defined as 0, the host memory
-    allocation is non-coherent.
-
-* ``hipHostMallocCoherent=1``, ``hipHostMallocNonCoherent=1``: Illegal.
-
-Visibility of Zero-Copy Host Memory
+.. note::
+
+  In order to achieve this fine-grained coherence, many AMD GPUs use a limited
+  cache policy, such as leaving these allocations uncached by the GPU, or making
+  them read-only.
+
+.. list-table:: Memory coherence control
+    :widths: 25, 35, 20, 20
+    :header-rows: 1
+    :align: center
+
+    * - API
+      - Flag
+      - :cpp:func:`hipMemAdvise` call with argument
+      - Coherence
+    * - ``hipHostMalloc``
+      - ``hipHostMallocDefault``
+      - 
+      - Fine-grained
+    * - ``hipHostMalloc``
+      - ``hipHostMallocNonCoherent`` :sup:`1`
+      -
+      - Coarse-grained
+    * - ``hipExtMallocWithFlags``
+      - ``hipDeviceMallocDefault``
+      -
+      - Fine-grained
+    * - ``hipExtMallocWithFlags``
+      - ``hipDeviceMallocFinegrained``
+      -
+      - Coarse-grained
+    * - ``hipMallocManaged``
+      -
+      -
+      - Fine-grained
+    * - ``hipMallocManaged``
+      -
+      - ``hipMemAdviseSetCoarseGrain``
+      - Coarse-grained
+    * - ``malloc``
+      -
+      -
+      - Fine-grained
+    * - ``malloc``
+      -
+      - ``hipMemAdviseSetCoarseGrain``
+      - Coarse-grained
+
+:sup:`1` At :cpp:func:`hipHostMalloc` the coherence mode can be affected by the
+``HIP_HOST_COHERENT`` environment variable, if the ``hipHostMallocCoherent=0``,
+``hipHostMallocNonCoherent=0``, ``hipHostMallocMapped=0`` and one of the other 
+flag is set to 1. At this case, if the ``HIP_HOST_COHERENT`` is not defined, or
+defined as 0, the host memory allocation is coarse-grained.
+
+.. note::
+
+  * At ``hipHostMallocMapped=1`` case the allocated host memory is 
+    fine-grained and the ``hipHostMallocNonCoherent`` flag is ignored.
+  * The ``hipHostMallocCoherent=1`` and ``hipHostMallocNonCoherent=1`` state is
+    illegal. 
+
+Visibility of synchronization functions
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-Coherent host memory is automatically visible at synchronization points.
-Non-coherent
+The fine-grained coherence memory is visible at synchronization points, however 
+at coarse-grained coherence, it depends on the used synchronization function.
+The synchronization functions effect and visibility on different coherence 
+memory types collected in the following table.
 
 .. list-table:: HIP API
 
@@ -305,7 +377,7 @@ Non-coherent
       - no
 
 ``hipEventSynchronize``
-""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Developers can control the release scope for :cpp:func:`hipEvents`:
 
@@ -342,48 +414,3 @@ CPU compiler's optimizations on memory access.
 Please note, HIP stream does not guarantee concurrency on AMD hardware for the
 case of multiple (at least 6) long-running streams executing concurrently, using
 ``hipStreamSynchronize(nullptr)`` for synchronization.
-
-Device memory
-================================================================================
-
-Device memory exists on the device, e.g. on GPUs in the video random access
-memory (VRAM), and is accessible by the kernels operating on the device. Recent
-architectures use graphics double data rate (GDDR) synchronous dynamic
-random-access memory (SDRAM) such as GDDR6, or high-bandwidth memory (HBM) such
-as HBM2e. Device memory can be allocated as global memory, constant, texture or
-surface memory.
-
-Global memory
---------------------------------------------------------------------------------
-
-Read-write storage visible to all threads on a given device. There are
-specialized versions of global memory with different usage semantics which are
-typically backed by the same hardware, but can use different caching paths.
-
-Constant memory
---------------------------------------------------------------------------------
-
-Read-only storage visible to all threads on a given device. It is a limited
-segment backed by device memory with queryable size. It needs to be set by the
-host before kernel execution. Constant memory provides the best performance
-benefit when all threads within a warp access the same address.
-
-Texture memory
---------------------------------------------------------------------------------
-
-Read-only storage visible to all threads on a given device and accessible
-through additional APIs. Its origins come from graphics APIs, and provides
-performance benefits when accessing memory in a pattern where the
-addresses are close to each other in a 2D representation of the memory. 
-
-The texture management module of HIP runtime API contains the functions of
-texture memory.
-
-Surface memory
---------------------------------------------------------------------------------
-
-A read-write version of texture memory, which can be useful for applications
-that require direct manipulation of 1D, 2D, or 3D hipArray_t. 
-
-The surface objects module of HIP runtime API contains the functions for surface
-memory create, destroy, read and write.
diff --git a/include/hip/hip_runtime_api.h b/include/hip/hip_runtime_api.h
@@ -717,7 +717,7 @@ enum hipLimit_t {
 
 /** Allocates the memory as write-combined. On some system configurations, write-combined allocation
  * may be transferred faster across the PCI Express bus, however, could have low read efficiency by
- * most CPUs. It's a good option for data tranfer from host to device via mapped pinned memory.*/
+ * most CPUs. It's a good option for data transfer from host to device via mapped pinned memory.*/
 #define hipHostMallocWriteCombined 0x4
 #define hipHostAllocWriteCombined 0x4