Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AO][SERVERLESS] Fix Custom Threshold rule tests for Serverless #167644

Merged

Conversation

fkanout
Copy link
Contributor

@fkanout fkanout commented Sep 29, 2023

Summary

Revert the revert for #166942 💪🏻

Fixes #165569
Fixes #166617
Fixes #166618
Fixes #166619
Fixes #166620

@apmmachine
Copy link
Contributor

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • /oblt-deploy : Deploy a Kibana instance using the Observability test environments.
  • /oblt-deploy-serverless : Deploy a serverless Kibana instance using the Observability test environments.
  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@fkanout fkanout self-assigned this Sep 29, 2023
@fkanout fkanout added backport:skip This commit does not require backporting Team: Actionable Observability - DEPRECATED For Observability Alerting and SLOs use "Team:obs-ux-management", for AIops "Team:obs-knowledge" v8.11.0 labels Sep 29, 2023
@fkanout fkanout added the release_note:skip Skip the PR/issue when compiling release notes label Sep 29, 2023
@fkanout fkanout marked this pull request as ready for review October 2, 2023 09:06
@fkanout fkanout requested a review from a team as a code owner October 2, 2023 09:06
@elasticmachine
Copy link
Contributor

Pinging @elastic/actionable-observability (Team: Actionable Observability)

@@ -51,10 +48,6 @@ export default function ({ getService }: FtrProviderContext) {
index: CUSTOM_THRESHOLD_RULE_ALERT_INDEX,
query: { term: { 'kibana.alert.rule.uuid': ruleId } },
});
await esClient.deleteByQuery({
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason to remove cleaning up the event log index?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maryam-saeidi, good question. Actually, in this use case, a.k.a "no_data," we don't ingest any data; hence, there is nothing to delete

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, the event log index is related to the result of executing a rule, not the ingested data.

You can check the test server with and without cleaning; you'll see without it, you still have some logs in this index.

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maryam-saeidi thank you for pointing that out. There is a race condition on this index as we use the wildcard option *, which creates inconsistent results and sometimes throws an error.

"cause":{"type":"version_conflict_engine_exception","reason":"[S_D574oB8qizAIQCLOib]: version conflict, required seqNo [1], primary term [1]. but no document was found","index_uuid":"cZ1A-w-KS1qY34KrbPGHPA","shard":"0","index":".ds-.kibana-event-log-ds-2023.10.02-000001"},"status":409},{"index":".ds-.kibana-event-log-ds-2023.10.02-000001","id":"TPD574oB8qizAIQCMOiB","cause":{"type":"version_conflict_engine_exception","reason":"[TPD574oB8qizAIQCMOiB]: version conflict, required seqNo [2], primary term [1]. but no document was found","index_uuid":"cZ1A-w-KS1qY34KrbPGHPA","shard":"0","index":".ds-.kibana-event-log-ds-2023.10.02-000001"},"status":409}

So, I changed the delete term to use the rule.id instead of the alert.consumer to be more accurate and avoid this issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, how does changing the field solve the issue?
The error says there is no data based on the query, so it should be the same case for any field (including rule.id), isn't it?

Copy link
Contributor Author

@fkanout fkanout Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maryam-saeidi No, the error message is before I change it to rule.id

The error says there is no data based on the query,

This is why I removed the deleteByQuery in the beginning, but then I realized there is a racing condition.
If you look at the error type version_conflict_engine_exception. It means there are two operations at the same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why I removed the deleteByQuery in the beginning, but then I realized there is a racing condition.
If you look at the error type version_conflict_engine_exception. It means there are two operations at the same.

I still don't understand, either the document exists in the index, which should contain both fields (rule.id and kibana.alert.rule.consumer), or the document does not exist. How does checking one field instead of the other help?
In which case, do we have a document with rule.id that does not contain kibana.alert.rule.consumer?

Copy link
Contributor Author

@fkanout fkanout Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maryam-saeidi, the test files run in parallel. So we could have two test files trying to perform CRUD operations on the same index. As the error message I shared mentioned, version conflict, required seqNo [1], primary term [1].. e.g. A rule is running and still generating events while another after from another file tries to delete everything using the wildcard *. (race condition)

So when I scoped the deletion i.e. every rule/every test file deleted ONLY its generated event using rule.id we
no longer have any issue. We should have relied on the rule.id instead of consumer from the beginning.

Is that make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, super clear, thanks.

Then can we bring back cleaning the event log index with the rule.id condition? Otherwise, after this test, we still have data in the event log index.

await esClient.deleteByQuery({
  index: '.kibana-event-log-*',
  query: { term: { 'rule.id': ruleId } },
});

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I already got it back!

@fkanout
Copy link
Contributor Author

fkanout commented Oct 2, 2023

@elasticmachine merge upstream

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @fkanout

@fkanout fkanout enabled auto-merge (squash) October 2, 2023 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment