Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving message for access to unknown objects #28209

Merged
merged 20 commits into from
Dec 18, 2022
Merged

Conversation

pabloem
Copy link
Contributor

@pabloem pabloem commented Aug 31, 2022

Signed-off-by: Pablo pabloem@users.noreply.github.com

Why are these changes needed?

Consider this scenario:


ray.init()
my_objref = ray.put('MY TEXT')
ray.shutdown()

ray.init()

@ray.remote
def my_fn(obj):
  print(obj)

my_fn.remote(my_objref)

In this scenario, Ray will throw this error and kill the Python interpreter (without erroring out!). I've tried to rephrase the error to be a little easier to figure out and fix, and changed the check so that an error will be thrown rather than ending the whole process.


Changes to Ray APIs

The best overview you can get is by checking the changes to test_ray_shutdown.py. The behavior changes in the following scenarios:

First consider the following set up:

ray.init()
my_ref = ray.put("anystring")

@ray.remote
def f(s):
    print(s)

ray.shutdown()

ray.init()

In this case, my_ref is an ObjRef that exists from a previous Ray session, and start a new ray session that does not know this object. Without this change, the behavior is:

Call Result
f.remote([my_ref]) This would cause the Python interpreter to die
f.remote(my_ref) This would cause the Python interpreter to die
ray.get(my_ref, timeout=30) This would cause Ray to throw a Timeout exception
ray.get(my_ref) This would cause Ray to hang indefinitely
ray.wait([my_ref]) This would cause Ray to hang indefinitely
ray.wait([my_ref], timeout=30) This would cause Ray to throw a timeout
ray.wait([ray.put("something"), my_ref]) This would cause Ray to always return ObjRef("something")
ray.wait([ray.put("something"), my_ref], num_returns=2) This would cause Ray hang indefinitely

With these changes, the behavior for each call is now:

Call Result
f.remote([my_ref]) Throws ValueError with the expected explanation
f.remote(my_ref) Throws ValueError with the expected explanation
ray.get(my_ref, timeout=30) Throws ValueError with the expected explanation
ray.get(my_ref) Throws ValueError with the expected explanation
ray.wait([my_ref]) Throws ValueError with the expected explanation
ray.wait([my_ref], timeout=30) Throws ValueError with the expected explanation
ray.wait([ray.put("something"), my_ref]) This would cause Ray to always return ObjRef("something")
ray.wait([ray.put("something"), my_ref], num_returns=2) Throws ValueError with the expected explanation

I would argue that this change is preferable from the previous behavior because:

  • Python interpreter dying is never an expected outcome for a Python application.
  • Rather than hanging indefinitely, users now can know that the object they're requesting will never be available

That's it, I think.

(FYI on above section @stephanie-wang @jjyao)

Related issue number

I have not created an issue to track this.

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@pabloem pabloem changed the title [wip] Improving message for access to unknown objects Improving message for access to unknown objects Sep 7, 2022
@pabloem
Copy link
Contributor Author

pabloem commented Sep 7, 2022

This is meant to fix #28211 - how can I run the test I added to verify that?

r: @jjyao

@pabloem pabloem marked this pull request as ready for review September 7, 2022 17:01
@pabloem
Copy link
Contributor Author

pabloem commented Sep 7, 2022

I also added a test for #28341, but I have not added a fix

@pabloem
Copy link
Contributor Author

pabloem commented Sep 7, 2022

hm I see this does not raise an error either. How can I raise a Python error from C? : )

@richardliaw
Copy link
Contributor

@jjyao could you shepherd this?

@jjyao
Copy link
Collaborator

jjyao commented Sep 8, 2022

Sorry for the late review. I'll take a look tomorrow!

@pabloem
Copy link
Contributor Author

pabloem commented Sep 8, 2022

question:
I'm running the following:

bazel build -- //:ray_pkg //cpp:ray_cpp_pkg
pip install -e python/.
python -m pytest -v -s python/ray/tests/test_ray_shutdown.py

This works for my purposes, but it won't build Ray with all the compilation checks (e.g. -Wunused-result) - do you know what I might need for that?

@pabloem
Copy link
Contributor Author

pabloem commented Sep 8, 2022

ok at this point the failing tests seem like unrelated flakes. LMK if I should look into it!

(also, I'll sign the commit on the next review cycle)

@jjyao
Copy link
Collaborator

jjyao commented Sep 9, 2022

I actually think the check failure is fine here: you should never share the object ref out of band across clusters. If you do that then it's a bug in the application code and the check failure basically tells you to fix the application code. IMO, it probably doesn't worth the effort to propagate up a catchable exception (it doesn't make much sense to catch and recover from it, we should just fix the application code).

Maybe we should just improve the check failure message? WDYT @stephanie-wang ?

@pcmoritz
Copy link
Contributor

pcmoritz commented Sep 9, 2022

Hmm, it seems to me that having a Python program just terminate the interpreter is always bad and raising and exception is much preferable. I haven't reviewed the PR in detail but it doesn't seem to introduce that much complexity. So we should probably merge it.

While it doesn't make sense to catch the exception and recover from it, raising an exception is always better than just terminating the interpreter. Python is a safe language, so it shouldn't crash.

@pabloem
Copy link
Contributor Author

pabloem commented Sep 9, 2022

Thanks! Right, I don't think the python interpreter should ever just terminate like that, but rather throw a Python exception that will tell the application what is the problem. I've added a couple small tests that you can try in Ray without this change and see how the behavior without the change is at least unintuitive

(p.s. pushed signed-off commit)

@jjyao
Copy link
Collaborator

jjyao commented Sep 9, 2022

Ok, then I think we should check all the public apis that accept obj refs or nested obj refs and make sure they all throw the exception instead of check failure.

@pabloem
Copy link
Contributor Author

pabloem commented Sep 9, 2022

I've created a GetOwnerAddressOrDie method which will preserve the existing behavior for untested paths, and used GetOwnerAddress for paths that we can test. Do you think we could review the current scope of changes and have a follow-up item to review other API paths that may cause CPython death afterwards? : )

c_arg, c_owner_address)
if not op_status.ok():
raise PlasmaObjectNotAvailable(
'Object %r could not be found in this Ray session.' % arg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add possible reasons why this can happen to the message just like the check failure message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

return CCoreWorkerProcess.GetCoreWorker().GetOwnerAddress(
c_object_id).SerializeAsString()
CAddress c_owner_address
CCoreWorkerProcess.GetCoreWorker().GetOwnerAddress(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to check status here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the check there. This should be more reasonable, huh?

op_status = CCoreWorkerProcess.GetCoreWorker().GetOwnerAddress(
c_arg, c_owner_address)
if not op_status.ok():
raise PlasmaObjectNotAvailable(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this exception is for a different purpose based on its comment. Should it just be ValueError given that we pass invalid objects to the functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense. done.

appropriate_exception_thrown
), "ray.get is hanging on unknown object retrieval"


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a similar test for ray.wait() as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added one. please make sure to check on the behavior and lmk if it makes sense to you.

python/ray/tests/test_ray_shutdown.py Show resolved Hide resolved
@@ -279,13 +279,30 @@ Status CoreWorkerMemoryStore::Get(const std::vector<ObjectID> &object_ids,
/*abort_if_any_object_is_exception=*/true);
}

Status CoreWorkerMemoryStore::ObjectHasOwner(const ObjectID &object_id,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel a bool is enough here based on the function name?

Also I feel we can expose an ObjectHasOwner method in core_worker and do the validation at the beginning of public apis like ray.get() (normally where we do function inputs validation) so the remaining code can just assume the objects are always valid and we have a single logic deciding whether an object is valid. WDYT?

exception_thrown = False
try:
f.remote(my_ref) # This would cause full CPython death.
except ray.exceptions.RayError:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we check more specifically to make sure the exact error is thrown, RayError is too broad. We can even check the exception message.

Btw, you can use with pytest.raises(...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. thanks!

@@ -311,6 +328,10 @@ Status CoreWorkerMemoryStore::GetImpl(const std::vector<ObjectID> &object_ids,
count += 1;
} else {
remaining_ids.insert(object_id);
Status status = ObjectHasOwner(object_id, abort_if_any_object_is_exception);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think abort_if_any_object_is_exception and abort_if_missing are two different things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I've changed this

@jjyao
Copy link
Collaborator

jjyao commented Sep 13, 2022

Do you think we could review the current scope of changes and have a follow-up item to review other API paths that may cause CPython death afterwards? : )

sg :)

@pabloem
Copy link
Contributor Author

pabloem commented Oct 10, 2022

@jjyao any pointers for me to look at reproducing any possible failures locally? : )

@pabloem
Copy link
Contributor Author

pabloem commented Nov 1, 2022

@jjyao ptal? : )

@pabloem
Copy link
Contributor Author

pabloem commented Nov 1, 2022

i'll try to debug why the tests are not passing as expected...

@jjyao
Copy link
Collaborator

jjyao commented Nov 1, 2022

sorry for dropping the ball here, I'll take a look at the failures as well.

Signed-off-by: Pablo E <pabloem@apache.org>
Signed-off-by: Pablo E <pabloem@apache.org>
Signed-off-by: Pablo E <pabloem@apache.org>
Signed-off-by: Pablo E <pabloem@apache.org>
@pabloem
Copy link
Contributor Author

pabloem commented Dec 15, 2022

rebased!

I think this should be ready : ) @jjyao thanks

@pabloem pabloem closed this Dec 15, 2022
@pabloem pabloem reopened this Dec 15, 2022
@pabloem
Copy link
Contributor Author

pabloem commented Dec 15, 2022

reopening to rernu tests.......

Signed-off-by: Pablo E <pabloem@apache.org>
peytondmurray pushed a commit to peytondmurray/ray that referenced this pull request Dec 16, 2022
While working on ray-project#28209 I hit issue google/googletest#3514, which causes a couple gtest-dependent tests to be unable to build

Signed-off-by: Pablo E <pabloem@apache.org>
@pabloem pabloem requested a review from jjyao December 16, 2022 01:28
@pabloem
Copy link
Contributor Author

pabloem commented Dec 16, 2022

ok looks good finally : )

for (size_t i = 0; i < ids.size(); i++) {
rpc::Address unused_owner_address;
auto status = GetOwnerAddress(ids[i], &unused_owner_address);
if (!status.ok()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of not ok, can we check ObjectUnknownOwner explicitly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

for (size_t i = 0; i < ids.size(); i++) {
rpc::Address unused_owner_address;
auto status = GetOwnerAddress(ids[i], &unused_owner_address);
if (!status.ok()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jjyao
Copy link
Collaborator

jjyao commented Dec 16, 2022

Tests look good. I manually triggered mac tests, lets see if they are passing as well.

@pabloem
Copy link
Contributor Author

pabloem commented Dec 17, 2022

CI for the previous commit without the couple changes from comments: https://buildkite.com/ray-project/oss-ci-build-pr/builds/7636#0185179c-5905-4cf6-95bb-d6eea5422628

Signed-off-by: Pablo E <pabloem@apache.org>
@jjyao jjyao merged commit d57b582 into ray-project:master Dec 18, 2022
@jjyao
Copy link
Collaborator

jjyao commented Dec 18, 2022

Thanks!!

@pabloem pabloem deleted the patch-1 branch December 18, 2022 14:57
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
While working on ray-project#28209 I hit issue google/googletest#3514, which causes a couple gtest-dependent tests to be unable to build

Signed-off-by: Pablo E <pabloem@apache.org>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Capiru pushed a commit to Capiru/ray that referenced this pull request Dec 22, 2022
Signed-off-by: Pablo E <pabloem@apache.org>

When accessing unknown objects, Ray will kill the Python interpreter (without erroring out!). I've tried to rephrase the error to be a little easier to figure out and fix, and changed the check so that an error will be thrown rather than ending the whole process.

Signed-off-by: Capiru <gabriel_s_prado@hotmail.com>
rickyyx added a commit that referenced this pull request Jan 3, 2023
AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023
Signed-off-by: Pablo E <pabloem@apache.org>

When accessing unknown objects, Ray will kill the Python interpreter (without erroring out!). I've tried to rephrase the error to be a little easier to figure out and fix, and changed the check so that an error will be thrown rather than ending the whole process.
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 16, 2023
Signed-off-by: Pablo E <pabloem@apache.org>

When accessing unknown objects, Ray will kill the Python interpreter (without erroring out!). I've tried to rephrase the error to be a little easier to figure out and fix, and changed the check so that an error will be thrown rather than ending the whole process.

Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
Signed-off-by: Pablo E <pabloem@apache.org>

When accessing unknown objects, Ray will kill the Python interpreter (without erroring out!). I've tried to rephrase the error to be a little easier to figure out and fix, and changed the check so that an error will be thrown rather than ending the whole process.

Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
Signed-off-by: Pablo E <pabloem@apache.org>

When accessing unknown objects, Ray will kill the Python interpreter (without erroring out!). I've tried to rephrase the error to be a little easier to figure out and fix, and changed the check so that an error will be thrown rather than ending the whole process.

Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants