Skip to content

Commit

Permalink
Update memory management page
Browse files Browse the repository at this point in the history
  • Loading branch information
neon60 committed Sep 30, 2024
1 parent 79fd95d commit 2e4c7a0
Show file tree
Hide file tree
Showing 2 changed files with 150 additions and 123 deletions.
271 changes: 149 additions & 122 deletions docs/how-to/hip_runtime_api/memory_management.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,51 @@ its own distinct memory. Kernels execute mainly on device memory, the runtime
offers functions for allocating, deallocating, and copying device memory, along
with transferring data between host and device memory.

Device memory
================================================================================

Device memory exists on the device, e.g. on GPUs in the video random access
memory (VRAM), and is accessible by the kernels operating on the device. Recent
architectures use graphics double data rate (GDDR) synchronous dynamic
random-access memory (SDRAM) such as GDDR6, or high-bandwidth memory (HBM) such
as HBM2e. Device memory can be allocated as global memory, constant, texture or
surface memory.

Global memory
--------------------------------------------------------------------------------

Read-write storage visible to all threads on a given device. There are
specialized versions of global memory with different usage semantics which are
typically backed by the same hardware, but can use different caching paths.

Constant memory
--------------------------------------------------------------------------------

Read-only storage visible to all threads on a given device. It is a limited
segment backed by device memory with queryable size. It needs to be set by the
host before kernel execution. Constant memory provides the best performance
benefit when all threads within a warp access the same address.

Texture memory
--------------------------------------------------------------------------------

Read-only storage visible to all threads on a given device and accessible
through additional APIs. Its origins come from graphics APIs, and provides
performance benefits when accessing memory in a pattern where the
addresses are close to each other in a 2D representation of the memory.

The texture management module of HIP runtime API contains the functions of
texture memory.

Surface memory
--------------------------------------------------------------------------------

A read-write version of texture memory, which can be useful for applications
that require direct manipulation of 1D, 2D, or 3D hipArray_t.

The surface objects module of HIP runtime API contains the functions for surface
memory create, destroy, read and write.

Host Memory
================================================================================

Expand Down Expand Up @@ -168,19 +213,6 @@ The example code how to use pinned memory in HIP showed at the following example
The pinned memory allocation is effected with different flags, which details
described at :ref:`memory_allocation_flags`.

The pinned memory can coherent and non-coherent:

* Coherent host memory supports fine-grain synchronization while the kernel is
running. This is the default and is the easiest to use since the memory
is visible to the CPU at typical synchronization points. This memory allows
in-kernel synchronization commands such as :cpp:func:`threadfence_system` to
work transparently.
* Non-coherent memory can be cached by GPU, but cannot support synchronization
while the kernel is running. This can provide performance benefit,
but care must be taken to use the correct synchronization.

For further details, check :ref:`coherency_controls`.

.. _memory_allocation_flags:

Memory allocation flags of pinned memory
Expand All @@ -193,88 +225,128 @@ host memory:
not just the one on which the allocation is made.
* ``hipHostMallocMapped``: Map the allocation into the address space for
the current device, and the device pointer can be obtained with
``hipHostGetDevicePointer()``.
:cpp:func:`hipHostGetDevicePointer`.
* ``hipHostMallocNumaUser``: The flag to allow host memory allocation to
follow Numa policy by user. Please note this flag is currently only applicable
on Linux, under development on Windows.
follow Numa policy by user. Target of Numa policy is to select a CPU that is
closest to each GPU. Numa distance is the measurement of how far between GPU
and CPU devices.
* ``hipHostMallocWriteCombined``: Allocates the memory as write-combined. On
some system configurations, write-combined allocation may be transferred
faster across the PCI Express bus, however, could have low read efficiency by
most CPUs. It's a good option for data transfer from host to device via mapped
pinned memory.
* ``hipHostMallocCoherent``: Allocate coherent memory. Overrides
HIP_COHERENT_HOST_ALLOC for specific allocation. For further details, check
:ref:`coherency_controls`.
* ``hipHostMallocNonCoherent``: Allocate non-coherent memory. Overrides
HIP_COHERENT_HOST_ALLOC for specific allocation. For further details, check
:ref:`coherency_controls`.

All allocation flags are independent, and can be used in any combination without
restriction, for instance, ``hipHostMalloc`` can be called with both
restriction, for instance, :cpp:func:`hipHostMalloc` can be called with both
``hipHostMallocPortable`` and ``hipHostMallocMapped`` flags set. Both usage
models described above use the same allocation flags, and the difference is in
how the surrounding code uses the host memory.

Numa-aware host memory allocation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Numa policy determines how memory is allocated. Target of Numa policy is to
select a CPU that is closest to each GPU. Numa distance is the measurement of
how far between GPU and CPU devices.

By default, each GPU selects a Numa CPU node that has the least Numa distance
between them, that is, host memory will be automatically allocated closest on
the memory pool of Numa node of the current GPU device. Using
:cpp:func:`hipSetDevice` API to a different GPU will still be able to access the
host allocation, but can have longer Numa distance.

.. note::

By default, each GPU selects a Numa CPU node that has the least Numa distance
between them, that is, host memory will be automatically allocated closest on
the memory pool of Numa node of the current GPU device. Using
:cpp:func:`hipSetDevice` API to a different GPU will still be able to access
the host allocation, but can have longer Numa distance.

Numa policy is implemented on Linux and is under development on Microsoft
Windows.

.. _coherency_controls:

Coherency controls
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
--------------------------------------------------------------------------------

ROCm defines two coherency options for host memory:
AMD GPUs can have two different types of memory coherence:

* **Coarse-grained coherence** means that memory is only considered up to date at
kernel boundaries, which can be enforced through hipDeviceSynchronize,
hipStreamSynchronize, or any blocking operation that acts on the null
stream (e.g. hipMemcpy). For example, cacheable memory is a type of
coarse-grained memory where an up-to-date copy of the data can be stored
elsewhere (e.g. in an L2 cache).
* **Fine-grained coherence** means the coherence is supported while a CPU/GPU
kernel is running. This can be useful if both host and device are operating on
the same dataspace using system-scope atomic operations (e.g. updating an
error code or flag to a buffer). Fine-grained memory implies that up-to-date
data may be made visible to others regardless of kernel boundaries as
discussed above.

* Coherent memory : Supports fine-grain synchronization while the kernel is
running. For example, a kernel can perform atomic operations that are
visible to the host CPU or to other (peer) GPUs. Synchronization instructions
include ``threadfence_system`` and C++11-style atomic operations.

In order to achieve this fine-grained coherence, many AMD GPUs use a limited
cache policy, such as leaving these allocations uncached by the GPU, or making
them read-only.

* Non-coherent memory : Can be cached by GPU, but cannot support synchronization
while the kernel is running. Non-coherent memory can be optionally
synchronized only at command (end-of-kernel or copy command) boundaries. This
memory is appropriate for high-performance access when fine-grain
synchronization is not required.

HIP provides the developer with controls to select which type of memory is used
via allocation flags passed to :cpp:func:`hipHostMalloc` and the
``HIP_HOST_COHERENT`` environment variable. By default, the environment variable
``HIP_HOST_COHERENT`` is set to 0 in HIP.

The control logic in the current version of HIP is as follows:

* No flags are passed in: the host memory allocation is coherent, the
``HIP_HOST_COHERENT`` environment variable is ignored.
* ``hipHostMallocCoherent=1``: The host memory allocation will be coherent, the
``HIP_HOST_COHERENT`` environment variable is ignored.
* ``hipHostMallocMapped=1``: The host memory allocation will be coherent, the
``HIP_HOST_COHERENT`` environment variable is ignored.
* ``hipHostMallocNonCoherent=1``, ``hipHostMallocCoherent=0``, and
``hipHostMallocMapped=0``: The host memory will be non-coherent, the
``HIP_HOST_COHERENT`` environment variable is ignored.
* ``hipHostMallocCoherent=0``, ``hipHostMallocNonCoherent=0``,
``hipHostMallocMapped=0``, but one of the other ``HostMalloc`` flags is set:

* If ``HIP_HOST_COHERENT`` is defined as 1, the host memory allocation is
coherent.
* If ``HIP_HOST_COHERENT`` is not defined, or defined as 0, the host memory
allocation is non-coherent.

* ``hipHostMallocCoherent=1``, ``hipHostMallocNonCoherent=1``: Illegal.

Visibility of Zero-Copy Host Memory
.. note::

In order to achieve this fine-grained coherence, many AMD GPUs use a limited
cache policy, such as leaving these allocations uncached by the GPU, or making
them read-only.

.. list-table:: Memory coherence control
:widths: 25, 35, 20, 20
:header-rows: 1
:align: center

* - API
- Flag
- :cpp:func:`hipMemAdvise` call with argument
- Coherence
* - ``hipHostMalloc``
- ``hipHostMallocDefault``
-
- Fine-grained
* - ``hipHostMalloc``
- ``hipHostMallocNonCoherent`` :sup:`1`
-
- Coarse-grained
* - ``hipExtMallocWithFlags``
- ``hipDeviceMallocDefault``
-
- Fine-grained
* - ``hipExtMallocWithFlags``
- ``hipDeviceMallocFinegrained``
-
- Coarse-grained
* - ``hipMallocManaged``
-
-
- Fine-grained
* - ``hipMallocManaged``
-
- ``hipMemAdviseSetCoarseGrain``
- Coarse-grained
* - ``malloc``
-
-
- Fine-grained
* - ``malloc``
-
- ``hipMemAdviseSetCoarseGrain``
- Coarse-grained

:sup:`1` At :cpp:func:`hipHostMalloc` the coherence mode can be affected by the
``HIP_HOST_COHERENT`` environment variable, if the ``hipHostMallocCoherent=0``,
``hipHostMallocNonCoherent=0``, ``hipHostMallocMapped=0`` and one of the other
flag is set to 1. At this case, if the ``HIP_HOST_COHERENT`` is not defined, or
defined as 0, the host memory allocation is coarse-grained.

.. note::

* At ``hipHostMallocMapped=1`` case the allocated host memory is
fine-grained and the ``hipHostMallocNonCoherent`` flag is ignored.
* The ``hipHostMallocCoherent=1`` and ``hipHostMallocNonCoherent=1`` state is
illegal.

Visibility of synchronization functions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Coherent host memory is automatically visible at synchronization points.
Non-coherent
The fine-grained coherence memory is visible at synchronization points, however
at coarse-grained coherence, it depends on the used synchronization function.
The synchronization functions effect and visibility on different coherence
memory types collected in the following table.

.. list-table:: HIP API

Expand Down Expand Up @@ -305,7 +377,7 @@ Non-coherent
- no

``hipEventSynchronize``
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Developers can control the release scope for :cpp:func:`hipEvents`:

Expand Down Expand Up @@ -342,48 +414,3 @@ CPU compiler's optimizations on memory access.
Please note, HIP stream does not guarantee concurrency on AMD hardware for the
case of multiple (at least 6) long-running streams executing concurrently, using
``hipStreamSynchronize(nullptr)`` for synchronization.

Device memory
================================================================================

Device memory exists on the device, e.g. on GPUs in the video random access
memory (VRAM), and is accessible by the kernels operating on the device. Recent
architectures use graphics double data rate (GDDR) synchronous dynamic
random-access memory (SDRAM) such as GDDR6, or high-bandwidth memory (HBM) such
as HBM2e. Device memory can be allocated as global memory, constant, texture or
surface memory.

Global memory
--------------------------------------------------------------------------------

Read-write storage visible to all threads on a given device. There are
specialized versions of global memory with different usage semantics which are
typically backed by the same hardware, but can use different caching paths.

Constant memory
--------------------------------------------------------------------------------

Read-only storage visible to all threads on a given device. It is a limited
segment backed by device memory with queryable size. It needs to be set by the
host before kernel execution. Constant memory provides the best performance
benefit when all threads within a warp access the same address.

Texture memory
--------------------------------------------------------------------------------

Read-only storage visible to all threads on a given device and accessible
through additional APIs. Its origins come from graphics APIs, and provides
performance benefits when accessing memory in a pattern where the
addresses are close to each other in a 2D representation of the memory.

The texture management module of HIP runtime API contains the functions of
texture memory.

Surface memory
--------------------------------------------------------------------------------

A read-write version of texture memory, which can be useful for applications
that require direct manipulation of 1D, 2D, or 3D hipArray_t.

The surface objects module of HIP runtime API contains the functions for surface
memory create, destroy, read and write.
2 changes: 1 addition & 1 deletion include/hip/hip_runtime_api.h
Original file line number Diff line number Diff line change
Expand Up @@ -717,7 +717,7 @@ enum hipLimit_t {

/** Allocates the memory as write-combined. On some system configurations, write-combined allocation
* may be transferred faster across the PCI Express bus, however, could have low read efficiency by
* most CPUs. It's a good option for data tranfer from host to device via mapped pinned memory.*/
* most CPUs. It's a good option for data transfer from host to device via mapped pinned memory.*/
#define hipHostMallocWriteCombined 0x4
#define hipHostAllocWriteCombined 0x4

Expand Down

0 comments on commit 2e4c7a0

Please sign in to comment.