Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Obs AI Assistant] knowledge base integration tests #189000

Merged
merged 27 commits into from
Aug 5, 2024

Conversation

neptunian
Copy link
Contributor

@neptunian neptunian commented Jul 23, 2024

Closes #188999

  • integration tests for knowledge base api
  • adds new config field modelId, for internal use, to override elser model id
  • refactors knowledgeBaseService.setup() to fix bug where if the model failed to install when calling ml.putTrainedModel, we dont get stuck polling and retrying the install. We were assuming that the first error that gets throw when the model is exists would only happen once and the return true or false and poll for whether its done installing. But the installation could fail itself causing getTrainedModelsStats to continuously throw and try to install the model. Now user immediately gets error if model fails to install and polling does not happen.

@obltmachine
Copy link

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • /oblt-deploy : Deploy a Kibana instance using the Observability test environments.
  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@neptunian
Copy link
Contributor Author

/ci

@neptunian
Copy link
Contributor Author

/ci

@neptunian neptunian marked this pull request as ready for review July 25, 2024 00:45
@neptunian neptunian requested review from a team as code owners July 25, 2024 00:45
@neptunian neptunian added the release_note:skip Skip the PR/issue when compiling release notes label Jul 25, 2024
@kibanamachine
Copy link
Contributor

Flaky Test Runner Stats

🎉 All tests passed! - kibana-flaky-test-suite-runner#6629

[✅] x-pack/test/observability_ai_assistant_api_integration/enterprise/config.ts: 25/25 tests passed.

see run history

Copy link
Member

@jgowdyelastic jgowdyelastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment, but overall ML changes LGTM

if (!license.hasAtLeast('enterprise')) {
return defaultModelId;
}
if (configModelId) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we move this up before the license check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it below just to make sure someone without an enterprise license could not try to use it, but not sure it matters. And because we are only, at the moment, testing the enterprise functionality using the assistant where its allowed. Though, I'm not sure why we need to doInit to start with and create the ES assets if not enterprise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we await the license check here to make sure the license check is completed. We don't really care whether it is valid, from the looks of it - why even return a model id if the license is not valid? So I think we should move the license check out of this code, and throw an invalid license error further down. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like you and Søren had a similar conversation https://github.com/elastic/kibana/pull/181700/files#r1579341084

It seems like we were catching the invalid license error that ML was throwing but we didn't want to log it. So we check if its enterprise and just return the "default model id" even though its no use. I was also thinking we should not call doInit or stop getModelId from being called further up if they dont have the enterprise license.

I moved the config before license and added back in a comment that was removed that I think was helpful.

await ml.testResources.cleanMLSavedObjects();
});

it('returns correct status after knowledge base is setup', async () => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we test what happens before the model is imported?

Copy link
Contributor Author

@neptunian neptunian Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems, during setup, we assume that no installation error is going to happen. The call to create the model throws an error that isn't caught:

{
  name: 'ResponseError',
  message: 'action_request_validation_exception\n' +
    '\tRoot causes:\n' +
    '\t\taction_request_validation_exception: Validation Failed: 1: [model_type] must be set if [definition] is not defined.;2: [inference_config] must not be null.;'
}

This causes pRetry to call itself again and repeat the process til it finally fails after the 12 times. We display a toast with a 500 internal server error
Screenshot 2024-07-31 at 7 52 22 AM

These validation errors should not happen give elser2 should already have these properties set somewhere, but I think we should check the call to create model and stop if it fails for whatever reason. If i understand correctly, pRetry is mainly checking to see if a model is finished installing and we depend on a call to getTrainedModels status errors for that?

Copy link
Member

@dgieselaar dgieselaar Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I think happens is (and maybe we can e.g. use named functions to clarify this):

  • (1) Check whether model is installed
    -- 1A. If this check throws an error, and it's a resource_not_found_exception or status_exception, install the model, and re-evaluate step (1). this is what happens on a clean slate.
    -- 1B. If this check does not throw, but returns false (the model is installing, but not fully defined), retry (1) in n seconds.
    -- 1C. if this check does not throw, and returns true, consider the model successfully installed and available for deployment, and continue to step (2)
  • (2) Deploy installed model.
    -- 2A. If this throws with a resource_not_found_exception or status_exception, catch the error and continue to (3).
    -- 2B. If any other error, fail and exit the process.
  • (3) Check if model has been successfully deployed
    -- 3A. If successfully deployed, exit process with a 200
    -- 3B. If not successfully deployed, re-evaluate (3) in n seconds

To answer your question, I think we don't really retry the installModel call - we retry the calls to determine whether the model is installed. This is a consequence of the fact that a model install can be in progress. One thing that might clear this up is to move the installModel call out of the pRetry. Perhaps it should be something like:

  • installModelIfDoesNotExist()
  • pollForModelInstallCompleted()
  • deployModelIfNeeded()
  • pollForModelDeploymentCompleted()

Copy link
Contributor Author

@neptunian neptunian Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 1, that makes sense and aligns with the behaviour I'm seeing. The problem arises if the model cannot install we get "stuck" in 1A and call installModel repeatedly. This happens when putTrainedModel() in installModel throws which means getTrainedModels() will continue to throw because we never started the install.

Thanks, I'll see if I can clear it up and add error handling for when the model install can't start / throws.

Copy link
Contributor Author

@neptunian neptunian Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dgieselaar I updated (1) model installation process. Let me know if that's easier to read.

@kibanamachine
Copy link
Contributor

Flaky Test Runner Stats

🎉 All tests passed! - kibana-flaky-test-suite-runner#6669

[✅] x-pack/test/observability_ai_assistant_api_integration/enterprise/config.ts: 25/25 tests passed.

see run history

Copy link
Member

@dgieselaar dgieselaar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a ton, this is great!

@neptunian
Copy link
Contributor Author

/ci

@kibana-ci
Copy link
Collaborator

kibana-ci commented Aug 5, 2024

💚 Build Succeeded

  • Buildkite Build
  • Commit: 2471e7c
  • Kibana Serverless Image: docker.elastic.co/kibana-ci/kibana-serverless:pr-189000-2471e7c1f094

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@neptunian neptunian merged commit f18224c into elastic:main Aug 5, 2024
23 checks passed
@kibanamachine kibanamachine added v8.16.0 backport:skip This commit does not require backporting labels Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting ci:project-deploy-observability Create an Observability project release_note:skip Skip the PR/issue when compiling release notes Team:Obs AI Assistant v8.16.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Observability AI Assistant] integrations tests for knowledge base api
6 participants