Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Actionable Observability] [SPIKE] Investigate SLO definition #139213

Closed
Tracked by #137323
kdelemme opened this issue Aug 22, 2022 · 5 comments
Closed
Tracked by #137323

[Actionable Observability] [SPIKE] Investigate SLO definition #139213

kdelemme opened this issue Aug 22, 2022 · 5 comments
Assignees
Labels
Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" v8.5.0

Comments

@kdelemme
Copy link
Contributor

kdelemme commented Aug 22, 2022

Epic: #137323
RFC: https://docs.google.com/document/d/1-9w1WW9HoOCG7I4WAtTFi1Hfnh7BT11dctLVOQs7iwc/edit?usp=sharing

📝 Summary

We want to define how the SLO definition will be stored in Kibana Saved Object. This SLO definition will be used later to generate a Transformer to aggregate the data.

As part of this epic, we want to focus on two type of SLOs:

  • Latency, e.g. 99.5% of request served under 300ms
  • Availability, e.g. 99% of successful request

🧪 Experimentation

Run Kibana and ES locally, and then follow the instruction on this repository to start generating APM data: https://github.com/fkanout/elastic-apm-api-alerts-generator

After a while, you'll notice some data under the o11y-app:
image

Now you need to create the following index mappings and settings that the rollup index will use.

Index mappings & settings
PUT _ilm/policy/slo-data-policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_primary_shard_size": "50gb",
            "max_age": "30d"
          }
        }
      }
    }
  }
}

PUT _component_template/slo-data-mappings
{
  "template": {
    "mappings": {
      "properties": {
        "@timestamp": {
          "type": "date",
          "format": "date_optional_time||epoch_millis"
        },
        "slo": {
          "properties": {
            "id": {
              "type": "keyword",
              "ignore_above": 256
            },
            "numerator": {
              "type": "long"
            },
            "denominator": {
              "type": "long"
            },
            "context": {
              "properties": {
                "labels": {
                  "properties": {
                    "groupId": {
                      "type": "keyword"
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  },
  "_meta": {
    "description": "Mappings for SLO data"
  }
}


PUT _component_template/slo-data-settings
{
  "template": {
    "settings": {
      "index.lifecycle.name": "slo-data-policy"
    }
  },
  "_meta": {
    "description": "Settings for ILM"
  }
}

PUT _index_template/slo-data-template
{
  "index_patterns": ["slo-data-*"],
  "composed_of": [ "slo-data-mappings", "slo-data-settings" ],
  "priority": 500,
  "_meta": {
    "description": "Template for SLO rollup data"
  }
}

We can now start experimenting with aggregation and creating some transformers for the two SLOs:

Availability SLO

💡 This SLO uses APM metrics

This will create buckets of transaction.name (request endpoint) with good defined as the number of requests with a http status code [2xx, 3xx, 4xx], and total defined as the total number of requests.

Search apm-metrics with aggregation
POST metrics-apm*/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "match": {
            "transaction.root": true
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": "now-1h",
              "lte": "now"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "transactions": {
      "composite": {
        "sources": [
          {
            "transaction.name": {
              "terms": {
                "field": "transaction.name"
              }
            }
          },
          {
            "service.name": {
              "terms": {
                "field": "service.name"
              }
            }
          }
        ]
      },
      "aggs": {
        "good": {
          "filter": {
            "bool": {
              "should": [
              {
                "match": {
                  "transaction.result": "HTTP 2xx"
                }
              },
              {
                "match": {
                  "transaction.result": "HTTP 3xx"
                }
              },
              {
                "match": {
                  "transaction.result": "HTTP 4xx"
                }
              }
              ]
            }
          }
        },
        "total": {
          "value_count": {
            "field": "transaction.duration.histogram"
          }
        },
        "ratio": {
          "bucket_script": {
            "buckets_path": {
              "good": "good>_count",
              "total": "total"
            },
            "script": "params.good / params.total"
          }
        }
      }
    }
  }
}
Transformer
PUT _transform/apm-transaction-availability-example
{
  "source": {
    "index": "metrics-apm*",
    "runtime_mappings": {
      "slo.id": {
        "type": "keyword",
        "script": {
          "source": "emit('uuid-slo-availability')"
        }
      }
    }
  },
  "frequency": "1m",
  "dest": {
    "index": "slo-data-default"
  },
  "settings": {
    "deduce_mappings": false
  },
  "sync": {
    "time": {
      "field": "@timestamp",
      "delay": "60s"
    }
  },
  "pivot": {
    "group_by": {
      "slo.context.transaction.name": {
        "terms": {
          "field": "transaction.name"
        }
      },
      "slo.context.service.name": {
        "terms": {
          "field": "service.name"
        }
      },
      "slo.id": {
        "terms": {
          "field": "slo.id"
        }
      },
      "@timestamp": {
        "date_histogram": {
          "field": "@timestamp",
          "calendar_interval": "1m"
        }
      }
    },
    "aggregations": {
      "slo.numerator": {
        "filter": {
            "bool": {
              "should": [
              {
                "match": {
                  "transaction.result": "HTTP 2xx"
                }
              },
              {
                "match": {
                  "transaction.result": "HTTP 3xx"
                }
              },
              {
                "match": {
                  "transaction.result": "HTTP 4xx"
                }
              }
              ]
            }
          }
      },
      "slo.denominator": {
        "value_count": {
          "field": "transaction.duration.histogram"
        }
      }
    }
  }
}

POST _transform/apm-transaction-availability-example/_start
POST _transform/apm-transaction-availability-example/_stop
DELETE _transform/apm-transaction-availability-example
DELETE slo-data-default


POST slo-data-default/_search
{
  "query": {
    "match": {
      "slo.id": "uuid-slo-availability"
    }
  }
}

Latency SLO

💡 This SLO uses APM metrics

This creates buckets of transaction.name (request endpoint) with good defined as the number of requests with a latency < 3000ms, and total defined as the total number of requests.

Search apm metrics with aggregation
POST metrics-apm*/_search 
{
  "size": 0, 
  "query": {
    "bool": {
      "filter": [
      
        {
          "match": {
            "transaction.root": true
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": "now-1h",
              "lte": "now"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "transactions": {
      "composite": {
        "sources": [
          {
            "transaction.name": {
              "terms": {
                "field": "transaction.name"
              }
            }
          }
        ]
      },
      "aggs": {
        "good": {
          "range": {
            "field": "transaction.duration.histogram",
            "ranges": [
              {
                "to": 3000000
              }
            ]
          }
        },
        "total": {
          "value_count": {
            "field": "transaction.duration.histogram"
          }
        },
        "ratio": {
          "bucket_script": {
            "buckets_path": {
              "good": "good['*-3000000.0']>_count",
              "total": "total"
            },
            "script": "params.good / params.total"
          }
        }
      }
    }
  }
}
Transformer

PUT _transform/apm-transaction-latency-example
{
  "source": {
    "index": "metrics-apm*",
    "runtime_mappings": {
      "slo.id": {
        "type": "keyword",
        "script": {
          "source": "emit('uuid-slo-latency')"
        }
      }
    }
  },
  "frequency": "1m",
  "dest": {
    "index": "slo-data-default"
  },
  "settings": {
    "deduce_mappings": false
  },
  "sync": {
    "time": {
      "field": "@timestamp",
      "delay": "60s"
    }
  },
  "pivot": {
    "group_by": {
      "slo.context.transaction.name": {
        "terms": {
          "field": "transaction.name"
        }
      },
      "slo.context.service.name": {
        "terms": {
          "field": "service.name"
        }
      },
      "slo.id": {
        "terms": {
          "field": "slo.id"
        }
      },
      "@timestamp": {
        "date_histogram": {
          "field": "@timestamp",
          "calendar_interval": "1m"
        }
      }
    },
    "aggregations": {
      "_numerator": {
        "range": {
          "field": "transaction.duration.histogram",
          "ranges": [
            {
              "to": 3000000
            }
          ]
        }
      },
      "slo.numerator": {
        "bucket_script": {
          "buckets_path": {
            "numerator": "_numerator['*-3000000.0']>_count"
          },
          "script": "params.numerator"
        }
      },
      "slo.denominator": {
        "value_count": {
          "field": "transaction.duration.histogram"
        }
      }
    }
  }
}

POST _transform/apm-transaction-latency-example/_start
POST _transform/apm-transaction-latency-example/_stop
DELETE _transform/apm-transaction-latency-example
DELETE slo-data-default


POST slo-data-default/_search
{
  "query": {
    "match": {
      "slo.id": "uuid-slo-latency"
    }
  }
}

Latency SLO for "o11y-app" service and "GET /slow" transaction

Transformer

PUT _transform/apm-transaction-latency-get-slow-example
{
  "source": {
    "index": "metrics-apm*",
    "runtime_mappings": {
      "slo.id": {
        "type": "keyword",
        "script": {
          "source": "emit('uuid-slo-latency-get-slow')"
        }
      }
    },
    "query": {
      "bool": {
        "filter": [
          {
            "match": {
              "transaction.root": true
            }
          },
          {
            "match": {
              "service.name": "o11y-app"
            }
          },
          {
            "match": {
              "transaction.name": "GET /slow"
            }
          }
        ]
      }
    }
  },
  "frequency": "1m",
  "dest": {
    "index": "slo-data-default"
  },
  "settings": {
    "deduce_mappings": false
  },
  "sync": {
    "time": {
      "field": "@timestamp",
      "delay": "60s"
    }
  },
  "pivot": {
    "group_by": {
      "slo.context.transaction.name": {
        "terms": {
          "field": "transaction.name"
        }
      },
      "slo.context.service.name": {
        "terms": {
          "field": "service.name"
        }
      },
      "slo.id": {
        "terms": {
          "field": "slo.id"
        }
      },
      "@timestamp": {
        "date_histogram": {
          "field": "@timestamp",
          "calendar_interval": "1m"
        }
      }
    },
    "aggregations": {
      "_numerator": {
        "range": {
          "field": "transaction.duration.histogram",
          "ranges": [
            {
              "to": 3000000
            }
          ]
        }
      },
      "slo.numerator": {
        "bucket_script": {
          "buckets_path": {
            "numerator": "_numerator['*-3000000.0']>_count"
          },
          "script": "params.numerator"
        }
      },
      "slo.denominator": {
        "value_count": {
          "field": "transaction.duration.histogram"
        }
      }
    }
  }
}

POST _transform/apm-transaction-latency-get-slow-example/_start
POST _transform/apm-transaction-latency-get-slow-example/_stop
DELETE _transform/apm-transaction-latency-get-slow-example


POST slo-data-default/_search
{
  "query": {
    "match": {
      "slo.id": "uuid-slo-latency-get-slow"
    }
  }
}

Visualization

We can then visualize the SLOs with a Lens (this lens is aggregating the metrics per hour, in a real life example we might use 1d, 7d, 30d instead).
We could also visualize the SLO per transaction.name, e.g. latency SLO > GET /slow or availability SLO > GET /flaky
image

❓ Questions

  1. When an SLO is edited, should we remove the transformer as well as the transformed data from the destination index? Indeed, if we keep the previously rollup data, we won't be able to differentiate it from the new one added.
@kdelemme kdelemme self-assigned this Aug 22, 2022
@botelastic botelastic bot added the needs-team Issues missing a team label label Aug 22, 2022
@kdelemme kdelemme added Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" v8.5.0 and removed needs-team Issues missing a team label labels Aug 22, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/actionable-observability (Team: Actionable Observability)

@emma-raffenne emma-raffenne mentioned this issue Aug 22, 2022
19 tasks
@simianhacker
Copy link
Member

Defining and Registering a Saved Object in Kibana

Here is the Kibana Developer Guide for Saved Objects: https://docs.elastic.dev/kibana-dev-docs/key-concepts/saved-objects-intro

Here is a complete tutorial on defining a Saved Object and registering it: https://docs.elastic.dev/kibana-dev-docs/tutorials/saved-objects

Here is an example of a Saved Object type from the Infrastructure Monitoring UI

import type { SavedObjectsType } from '@kbn/core/server';
export const metricsExplorerViewSavedObjectName = 'metrics-explorer-view';
export const metricsExplorerViewSavedObjectType: SavedObjectsType = {
name: metricsExplorerViewSavedObjectName,
hidden: false,
namespaceType: 'single',
management: {
importableAndExportable: true,
},
mappings: {
dynamic: false,
properties: {},
},
};

Here is an example of registering the type with the Saved Objects service

// register saved object types
core.savedObjects.registerType(infraSourceConfigurationSavedObjectType);
core.savedObjects.registerType(metricsExplorerViewSavedObjectType);
core.savedObjects.registerType(inventoryViewSavedObjectType);
core.savedObjects.registerType(logViewSavedObjectType);

@simianhacker
Copy link
Member

Here is where the routes for Observability are defined: https://github.com/elastic/kibana/tree/main/x-pack/plugins/observability/server/routes

@simianhacker
Copy link
Member

After our discussion with the transform team, I think we should also use this pipeline to create monthly indices. We will need to modify the index_prefix_name to match the current Kibana space (default).

PUT _ingest/pipeline/slo-monthly-index-default
{
  "description": "Monthly date-time index naming for SLO data",
  "processors" : [
    {
      "date_index_name" : {
        "field" : "@timestamp",
        "index_name_prefix" : "slo-data-default-",
        "date_rounding" : "M"
      }
    }
  ]
}

We will also need to add "pipeline": "slo-monthly-index-default" attribute to the transformer's dest property.

@kdelemme
Copy link
Contributor Author

kdelemme commented Sep 6, 2022

This spike has been completed and implementation started

@kdelemme kdelemme closed this as completed Sep 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" v8.5.0
Projects
None yet
Development

No branches or pull requests

3 participants