[RFC]Observability Correlation Zones #123

YANG-DB · 2024-02-11T06:07:17Z

Correlation Zones RFC

The purpose of this RFC is to present an Observability driven framework that is targeting the automation and simplification of the daily Site Reliability Engineer (SRE) tasks.

This work will help an SRE with the use case of identification of critical issues within the system and attending to them in the fastest and simple manner.

The Problem Domain

In the observability domain the amount of data collected by different agents is overwhelming, there are various types of observations data which are collected into the system and are needed to be observed in a single projection.

The simplistic approach that includes directly monitoring different fields and metrics is not efficient, cant scale and requires ongoing monitoring effort which cost time and resources.

We are well aware that the daily tasks of researching an anomaly or an incident (in a system) is a difficult and time consuming task.

That is why we embarked on the mission of pre-collecting and prioritize these anomalies and incidents into a special research area we acknowledge as correlation zones.

Our Goal in this RFC is very specific:

“How do we simplify the SER’s daily system monitoring workflow in an effective and proactive way”

As an example for such case we can imagine a cloud based e-commerce system combined of few hundreds of services that is deployed in multiple regions and covers multiple geographical domains and time-zones.

The amount of incoming observability data collected from the system mount to a few dozen terabytes a day and includes multiple types of datapoints for different features of the system.

Solution Objectives

Our goal in this initiative is to drastically simplify & automate the SRE’s daily workflow and introduce a preemptive approach for monitoring and problem detecting use cases.

Leveraging the standard Observability Protocol and data collection pipeline in addition with deep insight we collected from large customers interactions we are introducing the correlation zones concept, framework and dashboards for helping the Observability community and industry.

The correlation zones concept and framework will revolutionize the way the observability SRE's are working and make room for an efficient and scalable usage patterns for using these foundations at scale.

One key goal in this (correlation zones) framework is to reduce the SRE’s interaction with the (observability-monitoring) system to the minimal extend and only focus on the real significant segments that may solve a potential problem.

Another main requirement for this framework is to be independent from a specific opensearch version.
It should be deployed without the need to upgrade opensearch or the dashboard and be available as content for download directly from our website.

The decoupling of content from code will allow the next advantages :

Customer configuration and dynamic usage

Simple distribution and maintenance

Using versioning to differentiate content evolution (currently without auto-upgrade )

Collaborate and encourage the community to contribute and lead the content based solutions adoption

Observability Structured Schema

As part of the large community and industry effort to solve the large variety of different observability data collected we are using the OpenTelemetry protocol which consolidates the different signals arriving from various system collection agents.

By utilizing the Observability Simple Schema the collected data is structured in advanced and this helps to build special structured dashboards and workflow to accommodate this information

Integration infrastructure

During the past few month OpenSearch has introduced the concept of pre-defined and opinionated visualizations that are build with the vision of creating a ready for action dashboards and applications.

Integrations are a set of services oriented assets that are bundled together to represent a specific resource which outputs information that is later ingested and analyzed inside OpenSearch.
An integration relies on the well structured notion of the Observability domain and protocol to assist in the monitoring and visualizations of the system's different elements.

The correlation zones infrastructure leverage the integration in order to assemble the various parts of the workflows. These workflows are consistent of both background tasks (preparing the data for visualization) and visual assets that project and assist the engineer in the daily tasks.

Data Preparation And Transformations

Using the knowledge we obtained on typical daily workflows done by an SRE engineer - we are defining automation patterns and transformations that are the fundamental steps in the correlation zones .

The next paragraphs present these steps and detail how we are approaching the goal of simplifying the workflow:

1) Partitioning Traces By Features

There are cases in which there are dozens of different trace data producers that are not related to one another and it make no sense viewing them in a single pane.

Using the Simple Schema naming convention we are able to separately ingest and visualize different traces that belong to a different data perspectives whether they are due to different application Id’s, different domain tenant of any other feature by-which the customer would like to partition the trace data.

This correlation-zones solution addresses this partitioning using the build-in simple schema naming convention patterns that creates a custom mechanism for data partitioning according to a custom user-based rules.

Here are some guidance into how the ingestion routing can provide this pipeline based partitioning :

Data-prepper’s ingestion routing allows the customer to partition data in the actual pipeline itself.
OTEL collector’s transform processor allows s configuring multiple context statements for traces, metrics, and logs. The value of context specifies which OTTL Context to use when interpreting the associated statements.

2) Service Based Pivoting

The main entry point for the Observability engineer will be the Services projection. This concept creates the abstract notion of a high level (Service) entity which can be monitored and alerted by using different metrics and measurements.

A service has the following attributes and dimensions :

It is operating in the time domain - meaning all datapoint collected are time bases
it has a list of dimensions associated to that service features (also known as semantic conventions)
it has a set of aggregations associated to the service’s behavior
it has a list of threshold boundaries that advise on how to perform anomalies measurement (also known as SLO / SLI )

3) Correlation Zones indices

We acknowledged that the process of researching an anomaly or an incident (in a system) is a difficult and time consuming task.

That is why we embarked on the mission of pre-collecting and prioritize these anomalies and incidents into a special research area we acknowledge as correlation zones.

These correlations zones are deliberately constructed with the following agenda:

These are Service Based zones
They consist of very fine grained time frames
They are Collected and correlated for all the necessary dimensional data in one place
They are Pre-aggregate with the exact indication that leads to the anomaly incident identified
They are Prioritize according to custom defined rule

4) Priorities Based Rules

One key goal in this correlation zones framework is to reduce the SRE’s interaction with the Observability monitoring system to the minimal extend and only once the engineer need to actually intervene to solve a potential problem.
We are providing a new mechanism for the system to differentiate the indication found by the system and score each one according to some rule.

These rules will determine the working queue size and importance which accordance to the different services SLO's and monitoring objectives.

Correlation Zone Dashboard

The SRE will interact with these zones using a special dashboard that reflects the distinct correlation zone attributes and capabilities.

The Priority List - this is the list the engineer will engage with and it will reflect the daily working queue of pending investigations.
The specific Correlation investigation details dialog:
- Service information
- Time information
- Aggregation summary (Requests/Errors/Durations) charts
- Traces Summary and highlight (associated with this service) network graph
- Service Dimensions related metrics highlight (associated with this service) graph
  - Application Performance related
  - Logs attributes collected related

The Prioritized list of the daily investigation alerts / incidents

This is a vega based dashboard composed of multiple vega visualization widgets correlated using the distinct correlation zone.

Once the SRE engineer has selected to start the investigation for a specific row (correlation zone) the Correlation investigation details dialog will be opened fully in a new dialog and the following dashboard will be displayed:

Service information - including the network graph related to this service
Time information - including a zoom-out panel allowing to quickly zoom out
Aggregation summary (Requests/Errors/Durations) graphs
- including layover selection button to compare to another time frame
Trace gantt-chart showing the service related spans/traces
Dimensions related metrics highlight (associated with this service) graph
- Application Performance related - with layover selection button
- Logs attributes collected related - with layover selection button

In addition to the pre-build visualizations, the correlation dialog will include a query bar to specifically allow an advanced user to query / join different indices / datasources for further investigations or even data transformations using the PPL query language.

##TODO - add image of the visualization

Building the Correlation Zones

Generating the correlation zone requires a set of steps which include different operations, these are described in details in the next paragraphs .

Routing Observability Signals

The routing technique will use the OTEL collector ingestion pipeline for supplying the mechanism and infrastructure to determine the target indices that will hold the different information.

*** TODO - add examples ***

Preprocessing Services Aggregated Data

The preprocessing step will take advantage of the Transformation api to prepare and pre-aggregate relevant data for optimize query time performance and reduce cost in storage and compute.

*** TODO - add examples ***

Trigger based data processing

The existing Alert triggering mechanism will provide the necessary tool for defining a rule by which the workflow of collecting data for the correlation zone will be triggered and executed.

*** TODO - add examples ***

Using Integrations / Visualizations as building blocks

Utilizing the existing capability of dynamically loading different integrations will provide a mechanism of letting the customers to configure and optimize a workflow that will customize the correlation zone construction and behaviour

*** TODO - add examples ***

We are currently building the fundamental block that will consist the different steps in the workflow - feel free to add and comment on these elements and request new ones if needed using the "new-issue/Integration-suggestion"

Workflow based integrations

Workflow that transforms and prepares the data to be ready for the investigation process takes multiple steps.
Each step is described using a specific integration which describes the step's actions, parameters and API template.

These steps can be used separately but will bring the upmost value when used as a opinionated workflow which brings the value of the correlation-zone infrastructure directly and intuitively to the SRE engineering.

Some of the available steps:

Partitions step using _reindex API call for preparing the different trace partitions including additional filter-by parameters
RED Services Aggregation step using _transform API call for hourly aggregating the services into optimal time buckets for RED metrics analysis
Services SLO threshold monitor step using _alert API call for preparing a threshold based query for filtering the problems in the services behaviour.
Correlation Zone Building Action using _reindex API call for preparing the different correlation zone data collection and preparation of the suspected incidents into one index

Additional Context

This RFC represents an ongoing work of improving and refactoring the Observability capability of the OpenSearch codebase.

The text was updated successfully, but these errors were encountered:

YANG-DB · 2024-02-15T20:29:54Z

Additional new Traces gantt chart vega visualization:

customizable
version independent
time & dimensionality filter aware

Screen.Recording.2024-02-14.at.2.01.01.PM.mov

RyanL1997 · 2024-02-20T00:38:37Z

Hi @YANG-DB , thanks for putting this together, and I just left some high level/conceptual questions for some of the sections:

Partitioning Traces By Features / 2) Service Based Pivoting / 3) Correlation Zones indices + 4) Priorities Based Rules

For the above sections in "Data Preparation And Transformations", are these information should be all in the correlation zone dashboards? I'm asking this because I was trying to understand the userflow of the correlation zone. Nowadays, as you also mentioned in the RFC, it seems like by using the current userflow, user have to go through some data/info/panel which are irrelevant to the functionalities of traces. So the Correlation Zone Dashboard seems try to resolve this by gather all these trace related info into one place, is that correct?

*for the architecture of hte correlation zone graph.*

I understood that we are trying to gather all these information (Traces, Service-Map, Logs, and Metrics) into the Correlation Zone. However, I saw that some of them was coming fomr the Live Stream, so will this be a challenge for us to transfer this into the vega visualization? Or all of them can have a hourly aggregation such as Traces?

YANG-DB · 2024-02-21T00:44:55Z

The new observability services dashboard prototype which is vega based:

Screen.Recording.2024-02-20.at.2.46.20.PM.mov

YANG-DB added enhancement New feature or request untriaged labels Feb 11, 2024

YANG-DB self-assigned this Feb 11, 2024

YANG-DB added documentation Improvements or additions to documentation integration integration related content visualization a visual widget for a specific purpose and removed untriaged labels Feb 11, 2024

This was referenced Feb 12, 2024

[RFC]Observability Workflow API opensearch-project/observability#1805

Open

[FEATURE]Integration to allow generic Core API call opensearch-project/dashboards-observability#1442

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]Observability Correlation Zones #123

[RFC]Observability Correlation Zones #123

YANG-DB commented Feb 11, 2024 •

edited

Loading

YANG-DB commented Feb 15, 2024 •

edited

Loading

RyanL1997 commented Feb 20, 2024

YANG-DB commented Feb 21, 2024

[RFC]Observability Correlation Zones #123

[RFC]Observability Correlation Zones #123

Comments

YANG-DB commented Feb 11, 2024 • edited Loading

Correlation Zones RFC

The Problem Domain

Solution Objectives

Observability Structured Schema

Integration infrastructure

Data Preparation And Transformations

1) Partitioning Traces By Features

2) Service Based Pivoting

3) Correlation Zones indices

4) Priorities Based Rules

Correlation Zone Dashboard

Building the Correlation Zones

Routing Observability Signals

Preprocessing Services Aggregated Data

Trigger based data processing

Using Integrations / Visualizations as building blocks

Workflow based integrations

Additional Context

YANG-DB commented Feb 15, 2024 • edited Loading

RyanL1997 commented Feb 20, 2024

YANG-DB commented Feb 21, 2024

YANG-DB commented Feb 11, 2024 •

edited

Loading

YANG-DB commented Feb 15, 2024 •

edited

Loading