Skip to content

Best Practices Guide

Dmytro Vyazelenko edited this page Jan 19, 2024 · 43 revisions

Best Practices Guide

Just like any non-trivial system, Aeron has a set of best current practices associated with using it. This guide aims to provide the best practices with using Aeron for your message based communications. It is hoped this document will be a living document.

System Design

Systems utilising Aeron as a transport should consider the number of Channels and Streams as well as the number of Publishers and Subscribers on each channel and stream. A stream, while only being a number, requires a fixed amount of resources. This includes buffering between the Media Driver and the client applications. Aeron has been designed so that the normal number of streams in use will be 10s or 100s. Maybe 1000s, but very very unlikely.

By the fact that streams are within a channel, the number of channels is assumed to be small in number as well.

Systems that need large numbers of streams for separation should consider a framing (and muxing) protocol on top of Aeron.

Settings

Aeron has many settings that can be used to tweak various aspects of operation. However, it should only be necessary to adjust these settings to get the most out of the system. In most cases, the defaults should operate just fine and provide, if not optimal, at least a decent starting point for further optimisations via resource trade-offs.

Scattered throughout the topics below, you will see some settings mentioned. This is not an exhaustive list of settings. For that, please see the source or feel free to ask questions.

Threading

Applications as well as the Media Driver may have a number of threads concerned with various aspects of Aeron operation.

Media Driver

A Media Driver, whether being run embedded or not, needs 1-3 threads to perform its operation. The system property aeron.threading.mode controls how many threads a Media Driver instance needs to use for operation.

There are three main Agents in the driver:

  1. Conductor: Responsible for reacting to client requests and house keeping duties as well as detecting loss, sending NAKs, rotating buffers, etc.
  2. Sender: Responsible for transferring messages from publishers to the network media.
  3. Receiver: Responsible for transferring messages from the network media to subscribers.

The value of aeron.threading.mode can be one of:

  • INVOKER: No threads. The client is responsible for using the MediaDriver.Context.driverAgentInvoker() to invoke the duty cycle directly.
  • SHARED: All Agents share a single thread. 1 thread in total.
  • SHARED_NETWORK: Sender and Receiver shares a thread, conductor has its own thread. 2 threads in total.
  • DEDICATED: The default and dedicates one thread per Agent. 3 threads in total.

For performance, it is recommended to use DEDICATED as long as the number of busy threads is less than or equal to the number of spare cores on the machine. If there are not enough cores to dedicate, then it is recommended to consider sharing some with SHARED_NETWORK or SHARED. INVOKER can be used for low resource environments while the application using Aeron can invoke the media driver to carry out its duty cycle on a regular interval.

Idle Strategies

Within the Media Driver and possibly within some applications, Idle Strategies might be used to aid in specifying what Agent duty cycles should do if/when no work is done. An Idle Strategy takes a param indicating how much work was done in the last duty cycle and handles idling in various ways. You can specify your own idle strategies also.

There are a couple strategies of importance to understand.

  1. BusySpinIdleStrategy uses a busy spin as an idle and will eat up CPU by default.
  2. BackOffIdleStrategy uses a backoff strategy of spinning, yielding, and parking to be kinder to the CPU, but to be less responsive to activity when idle for a little while.

The main difference in strategies is how responsive to changes should be the idler be when idle for a little bit of time and how much CPU should be consumed when no work is being done. There is an inherent tradeoff to consider.

Media Driver Mains

There are a couple default Media Driver main functions provided for operation. A Media Driver may use one of these when used as a stand-alone process.

  1. MediaDriver is the default main and, by default, uses the BackOffIdleStrategy for idling. The aeron.threading.mode can be used to further refine the threading model.
  2. LowLatencyMediaDriver is the primary main for performance and uses the BusySpinIdleStrategy for Conductor and NoOpIdleStrategy for Sender and Receiver Agents. This main function automatically uses DEDICATED threading mode.

Application Threads

Aeron applications have most of the threading requirements controlled by the application. However, there is a per Aeron instance background thread, called the ClientConductor, that handles housekeeping and interacting with the Media Driver commands. This thread may be controlled by the application via setting a Aeron.Context.threadFactory() or letting Aeron spin up its own Thread.

In many cases, this thread has very simple requirements and can be run on a dirty CPU. i.e. it doesn't need to have a dedicated CPU to function well.

Subscriber applications have more requirements, however.

Subscribers must routinely call Subscription.poll to check for and deliver messages to the application. For the lowest latency and highest throughput, it is recommended to use a high frame limit for this call as well as BusySpinIdleStrategy or equivalent application control and dedicate a core to reception. The Agent class could be used to encapsulate this behaviour easily.

MTU Considerations

The Aeron MTU value impacts a lot of things. The default MTU is set to a value that is a good trade-off. However, it is suboptimal for some use cases involving very large (> 4KB) messages and for maximizing throughput above everything else. Various checks during publication and subscription/connection setup are done to verify a decent relationship with MTU. However, it is good to understand these relationships.

aeron.mtu.length on the Media Driver controls the length of the MTU of data frames. This value is communicated to the Aeron clients during registration. So, applications do not have to concern themselves with the MTU value used by the Media Driver and use the same value.

An MTU value over the interface MTU will cause IP to fragment the datagram. This may increase the likelihood of loss under several circumstances. If increasing the MTU over the interface MTU, consider various ways to increase the interface MTU first in preparation.

The MTU value indicates the largest message that Aeron will send as a single data frame.

MTU length also has implications for socket buffer sizing. Please see below.

Buffering Considerations

Aeron instances in application, commonly referred to as "clients", communicate with Media Drivers via a set of buffers. The location of these buffers is normally in the OS file system. By default, the java.io.tmpdir or /dev/shm/ is used to hold these files. However, on systems without the /dev/shm support it can be advantageous to move them to other places (see OS Related Considerations). The following property controls the directory that Media Drivers and Aeron instances use:

  • aeron.dir is the location directory containing the Aeron files.

Bounds checks are done by the buffer primitives by default in Aeron. These do take up some CPU cycles, but normally are predicted out. However, they can be disabled by setting agrona.disable.bounds.checks to true.

The length of term buffers is controlled by aeron.term.buffer.length and aeron.ipc.term.buffer.length properties. The max length of a term buffer is 1GB. If larger than this, an exception will be generated and shown on the Media Driver console. Setting the term buffer length is mostly a concern for how far ahead a Publisher might be from Subscribers. As a quick and dirty measure, a single term buffer is the measure. For more details see Flow Control.

Congested Networks and Loss

When running Aeron over a network that is possibly congested and thus could experience significant loss then consider running with Congestion Control enabled. Loss can be detected by NAK counters increasing which can be observed with the AeronStat and investigated in detail with the LossReport tool.

Monitoring Considerations

Monitoring of various aspects of operation can be done by using the AeronStat utility to display the value of the various counters of the Media Driver and clients. In addition, reading these counters programmatically is relatively simple.

Flow Control Considerations

Flow control is discussed in terms of how it functions. However, the implications for usage may not be obvious.

The Receiver Window is how much data a Sender can send immediately to a Receiver. This window length has a lot to do with the maximum throughput of a stream. The larger the window, the more throughput. The default window length allows for decent rates while limiting the amount of outstanding data before a publisher is flow controlled. Increasing the length of the window to 2MB or more should be plenty in most situations to allow high throughput rates.

OS Related Considerations

Operating system socket buffers have an impact on some of the settings within Aeron.

  • SO_RCVBUF can impact loss rates when too small for the given processing. If too large, this buffer can increase latency. Values that tend to work well with Aeron are 2MB to 4MB. This setting must be large enough for the MTU of the sender. If not, persistent loss can result. In addition, the receiver window length should be less than or equal to this value to allow plenty of space for burst traffic from a sender.

  • SO_SNDBUF can impact loss rate. Loss can occur on the sender side due to this buffer being too small. This buffer must be large enough to accommodate the MTU as a minimum. In addition, some systems, most notably Windows, need plenty of buffering on the send side to reach adequate throughput rates. If too large, this buffer can increase latency or cause loss. This usually should be less than 2MB.

Linux

As was mentioned above, changing the location of the buffers for Aeron can be a good thing. For Linux, this means that /dev/shm will be the location of the buffers if present.

Linux normally requires some settings of sysctl values. One is net.core.rmem_max to allow larger SO_RCVBUF and net.core.wmem_max to allow larger SO_SNDBUF values to be set.

Windows

Windows tends to use SO_SNDBUF values that are too small. It is recommended to use values more like 1MB or greater.

Note: Since Windows does not have built in support for /dev/shm it is advised to create a RAM disk for the Aeron directory (aeron.dir). This can be done with a tool like http://www.radeonramdisk.com/.

Mac/Darwin

Mac tends to use SO_SNDBUF values that are too small. It is recommended to use larger values, like 16KB.

Note: Since Mac OS does not have a built-in support for /dev/shm it is advised to create a RAM disk for the Aeron directory (aeron.dir).

You can create a RAM disk with the following command:

$ diskutil erasevolume HFS+ "DISK_NAME" `hdiutil attach -nomount ram://$((2048 * SIZE_IN_MB))`

where:

  • DISK_NAME should be replaced with a name of your choice.
  • SIZE_IN_MB is the size in megabytes for the disk (e.g. 4096 for a 4GB disk).

For example, the following command creates a RAM disk named DevShm which is 2GB in size:

$ diskutil erasevolume HFS+ "DevShm" `hdiutil attach -nomount ram://$((2048 * 2048))`

After this command is executed the new disk will be mounted under /Volumes/DevShm.

C Error handling

The C code used throughout Aeron has a specific mechanism built to manage the propagation of errors in a useful and readable way. There are specific patterns for it use, which are outlined here. There are two main macros that should be used when handling errors (three if you are using Windows). They are AERON_SET_ERR, AERON_APPEND_ERR and AERON_SET_ERR_WIN. These macros allow for the code to build up an error stack that traces an error through the call stack and allow functions to add additional context information to the error message to aid diagnosis without have to push data down or up the call stack to where the original error is reported.

Using AERON_SET_ERR

The AERON_SET_ERR macro should be used at first point where an error is encountered. There are two common cases for its used. Firstly when an error is returned from any function that does NOT start with aeron_. Most commonly this will be a system calls or calls to third party libraries (e.g. setsockopt, sendmsg). Secondly if some input validation is required within a function and the input is incorrect the function should use AERON_SET_ERR. The first parameter to AERON_SET_ERR is an error code. This can be a system error code, e.g. one that is set in errno by a libc call or it can be a negated aeron error code. The remaining parameters is a printf format string and a matching variable arguments list. If an Aeron function has call AERON_SET_ERR then it will generally

Using AERON_APPEND_ERR

The AERON_APPEND_ERR call is used when an error has been detect by an Aeron function. I.e. that is any function prefixed with aeron_ that returns -1 indicating and error. In this case the assumption should be that AERON_SET_ERR has already been called. The AERON_APPEND_ERR adds context information to the error stack.