Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLOs API - Phase 1 #137323

Closed
19 tasks done
emma-raffenne opened this issue Jul 27, 2022 · 7 comments
Closed
19 tasks done

SLOs API - Phase 1 #137323

emma-raffenne opened this issue Jul 27, 2022 · 7 comments
Assignees
Labels
epic Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" v8.5.0 v8.6.0

Comments

@emma-raffenne
Copy link
Contributor

emma-raffenne commented Jul 27, 2022

Context

Objective

The objective of this first phase is to provide an API for CRUD operations on SLOs that would support Service and Transaction types of entity, and the following APM metrics:

  • Latency (i.e., performance)
  • Transaction Failure / Errors (i.e., availability)

That should allow for tracking:

  • SLO %
  • Error rate (i.e., rate of “bad” values)
  • Error budget, consumption

Requirements

  1. Create/Edit an SLO with the following fiels
  • Entity
  • Metric
  • Name for the destination index. We should also add some text about the required permissions for users to view roles.
  • Whether record/document count based, or time based
    • Note: count based is the ratio based on the number of “good samples” to the total number of samples, and time based is the ratio based on the number of units of time where the total good samples to total samples was “good”, to total units of time. The unit of time can be a 5 minute period, a 1 hour period and so on. The openSLO project defines it as
      • budgetingMethod enum(Occurrences | Timeslices), required field
      • Occurrences method uses a ratio of counts of good events and total count of the event.
      • Timeslices method uses a ratio of good time slices vs. total time slices in a budgeting period.
  • Threshold value (for latency)
  • SLO period, and whether rolling window or calendar based. Period lengths are N days / N weeks / N months / arbitrary period
  • Service Level Objective (in percentage) (Note: this field is optional. If user wants, they can simply set up and monitor SLIs on their metrics without a goal in mind, and hence no alerting either)
  1. Delete SLO

  2. API must be Terraform, Ansible friendly (“Friendly” = an SRE must be able to write a Terraform based script that can successfully utilize our APIs to perform CRUD on SLOs without requiring a PhD)

Implementation

For version 8.5.0:

For version 8.6.0:

@emma-raffenne emma-raffenne added Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" v8.5.0 labels Jul 27, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/actionable-observability (Team: Actionable Observability)

@fkanout
Copy link
Contributor

fkanout commented Aug 18, 2022

I want to share some thoughts and questions that could help create the technical issue, and also share knowledge with the engineering team.

Important note: This epic (Phase 1) is ONLY about defining the SLO (no SLIs, no link between SLOs and SLIs, no alerts based on a specified threshold, etc.)

I'm highlighting this note to make it clear that in this phase, we are ONLY CRUD SLOs (docs in an index) and not the entire/final SLO solution (which is what I thought at the beginning when I started thinking about how/who going to call our API, etc...)

Thoughts

  1. Observability App/Plugin has a server/routes folder where the SLOs API will most likely live. Today we have only one endpoint
  2. The index mapping and settings process should be done before any CRUD operations. Look at @simianhacker POC for more info.
  3. [IMPORTANT]: SLOs route access management.
  4. Need to acquire knowledge about the _transform API, which is essential to downsample the data. For more info, check this doc
  5. As SLOs will be stored as Kibana Saved objects. It is worth checking its doc as we are going to use Saved objects API

Questions

  1. Is the Observability plugin instantiation the right place to run the above 2nd point (index mapping)?
  2. Do we need to hide this feature behind a feature flag?
  3. Does the usage of the _transform API is required for this phase as we are only defining SLOs?
  4. What are the required role(s) for a given user to access the SLOs endpoints
  5. What does API must be Terraform, Ansible friendly mean? The SLOs API is not a REST or REST-like API?

@simianhacker
Copy link
Member

  1. Is the Observability plugin instantiation the right place to run the above 2nd point (index mapping)?

Yes, the implementation for this project will happen under the observability plugin.

  1. Do we need to hide this feature behind a feature flag?

Yes, this will be a multiple release project so we will need to hide the feature under a flag.

  1. Does the usage of the _transform API is required for this phase as we are only defining SLOs?

In this phase I would like to achieve the following:

  1. Create SLO Definition
  2. Ensure index patterns are installed (should be triggered when the first SLO definition is created)
  3. Create the corresponding transform based on the definition (and start it)
  1. What are the required role(s) for a given user to access the SLOs endpoints

At a minimum, for read access they will need "read" permissions on the SLO data indices (created by the transform). For the SLO detail page, to see the status of the transform, they would need monitor_transform permission. For creating the SLO's they would need manage_transform, view_index_metadata, create_index, index, manage, and read for the destination indices.

  1. What does API must be Terraform, Ansible friendly mean? The SLOs API is not a REST or REST-like API?

The isn't a great "As Code" story for using Kibana APIs today. The best we can do is create a REST endpoint, and then work with the Kibana team to make using those endpoints easier.

@vinaychandrasekhar
Copy link

On #5 above, one additional note is that not all REST APIs are automatically Ansible and Terraform friendly. Part of the effort should be to ensure something easily written in those products can interact with our SLO capability via those REST APIs.

@simianhacker
Copy link
Member

@vinaychandrasekhar Our best approach is to use Ansible to test the API during development.

@fkanout
Copy link
Contributor

fkanout commented Aug 24, 2022

Yes, the implementation for this project will happen under the observability plugin.

@simianhacker, I had to be more precise, I know it's under the observability plugin 😄. But my question is where under the observability plugin?
e.g plugin file? Once the plugin is started? Or somewhere else?

@simianhacker
Copy link
Member

@fkanout I was talking to @kdelemme about this and I would prefer we had a function called setupSLOResources that we called at the beginning of the API request that creates the SLO. The first time it's called, it should check to see if the resources exists and only install all the resources (ILM Policy, Index Mappings, Index Settings, and Index Template) when missing; if they exist, it should just return without taking action.

@kdelemme kdelemme self-assigned this Sep 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" v8.5.0 v8.6.0
Projects
None yet
Development

No branches or pull requests

6 participants