Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix finding "prte" for singleton comm_spawn #12363

Closed
wants to merge 1 commit into from

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented Feb 22, 2024

Copy the search code from mpirun to use when starting "prte" in support of comm_spawn. This respects things like OPAL_PREFIX.

Refs #12349

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 22, 2024

Is there a way to restart the NVIDIA CI when it fails to deploy?

@rhc54 rhc54 force-pushed the topic/dpm branch 2 times, most recently from cd57e0a to 397f9f1 Compare February 23, 2024 02:03
ompi/dpm/dpm.c Outdated Show resolved Hide resolved
ompi/dpm/dpm.c Outdated Show resolved Hide resolved
@dalcinl
Copy link
Contributor

dalcinl commented Feb 23, 2024

After looking at PRRTE code, it looks like the prterun and prte are indeed two different things. Therefore, I'm having a hard time understanding how a single environment variable named OMPI_PRTERUN can be used to specify two different tools. IMHO, mpirun should look for OMPI_PRTERUN, and singleton-init mode should look for OMPI_PRTE, or at least something different than OMPI_PRTERUN.

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 23, 2024

You are overthinking this - just leave it be, please. It is correct and will work correctly as written.

ompi/dpm/dpm.c Outdated
@@ -1974,6 +1975,46 @@ static void set_handler_default(int sig)
sigaction(sig, &act, (struct sigaction *)0);
}

static char *find_prte(void)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a function doing the exact same thing in ompi/tools/mpirun/main.c. Instead of duplicating code maybe we should unify the two and have a consistent approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look forward to your PR. This seemed a rather trivial duplication for a very limited purpose.

@bosilca
Copy link
Member

bosilca commented Feb 23, 2024

I have mixed feelings about this solution. OPAL_PREFIX (and friends) in the context of mpirun will impact the path for all the spawnees. It will give them access to the correct executables and the correct libraries, all from one single starting process, aka. mpirun. Here, we are trying to address the singleton issue and as such we are in a completely different setting because we are not in the mpirun process but directly in the user application. Thus, changes via OPAL_PREFIX are too late to impact this process but will impact all children leading to potentially incorrect matching between libmpi and loaded components.

Basically, I think that ignoring OPAL_PREFIX for singleton is the right approach.

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 23, 2024

Thus, changes via OPAL_PREFIX are too late to impact this process but will impact all children leading to potentially incorrect matching between libmpi and loaded components.

Ummm...that isn't actually true. OPAL_PREFIX had to be set in the environment of the singleton or else we wouldn't be seeing it here. So it has already impacted the singleton itself. All we are doing is translating that value to use in (a) locating the corresponding prte executable, and then (b) ensuring that PRRTE and PMIx all pick up their correct corresponding libraries.

Ignoring OPAL_PREFIX means that you suddenly are not guaranteed to be picking up the corresponding prte and its associated libraries - which puts you into unknown territory. Probably not what you really want.

@bosilca
Copy link
Member

bosilca commented Feb 23, 2024

Correct, OPAL_PREFIX is set but it will be ignored for the mpi library. It will only affect component/path after the MPI library is loaded, which is my exact concern.

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 23, 2024

I'm happy to simply withdraw the PR - it will leave singleton comm-spawn broken when the installation is relocated, but I guess you folks can figure out how you want to resolve that problem (or not).

@bosilca
Copy link
Member

bosilca commented Feb 23, 2024

We need to discuss this seriously and come up with a sane approach (maybe not allowing singletons).

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 23, 2024

I'm happy to leave this here for your consideration. If it is of any help, I believe it was Brian who pointed out one time that the Standard doesn't say you have to support singleton comm_spawn in the absence of a supporting RTE. If so, it could be sufficient for comm_spawn to return an error message indicating that a supporting RTE wasn't found, perhaps suggesting the setup a PRRTE DVM and try again.

@dalcinl
Copy link
Contributor

dalcinl commented Feb 24, 2024

(maybe not allowing singletons).

And then all the Python folks using mpi4py.futures within an IPython notebook running in the browser will hit my inbox asking why the thing does not work.

@bosilca What if the OPAL_PREFIX to PMIX/PRRTE translation business is done at MPI_Init time for the singleton init case? At least the thing will happen consistently and irrespective of whether you use spawn or not. Of course modifying a process's environment mid-flight (i.e at MPI_Init time) is maybe not great design, but IMHO that's better than just not supporting a feature or have it broken as now.

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 24, 2024

Hmmm...I don't think that will help, I'm afraid. However, I'm puzzling over the actual impact of setting OPAL_PREFIX in the environment. Looking at the RTE code, all we do is take the OPAL_PREFIX value and ensure that it is in the environment of the application process when we start it. We don't prefix it to LD_LIBRARY_PATH or do anything else with it - just set it in the environment.

Which means that the application process is getting the exact same behavior as the singleton. They both only see OPAL_PREFIX in their environment, and do whatever with it from that point (which appears to be simply influencing where to find the OMPI plugins). Most importantly, OPAL_PREFIX isn't altering the initial LD_LIBRARY_PATH used to find the initial libmpi. Perhaps that is a (very long standing) error?

Perhaps even more interesting is that PRRTE doesn't even look for OPAL_PREFIX, and therefore does not propagate it like ORTE used to do. Is that intentional? Or did someone forget to do that in the schizo/ompi plugin?

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 24, 2024

We don't prefix it to LD_LIBRARY_PATH or do anything else with it - just set it in the environment.

Actually, I think I have to correct myself here - had to look further back in the codepath. OPAL_PREFIX alters the opal_installdir values, and we do prepend those onto LD_LIBRARY_PATH during launch.

For OMPI v5, the key is that code which translates OPAL_PREFIX to PRTE_PREFIX prior to invoking prterun when using the internal PRRTE installation. PRRTE will pickup that value and prepend LD_LIBRARY_PATH

Sadly, if using an external PRRTE, OPAL_PREFIX will have zero impact. Probably an error of omission - someone probably should fix the schizo/ompi component so that doesn't happen.

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 24, 2024

So I got curious and did some sleuthing, and here is what is going on. In order for a singleton to start, it has to find libmpi - which means the user must have set LD_LIBRARY_PATH to point to the OMPI installation, wherever it got moved to. Because the install was moved, you also must set OPAL_PREFIX so that OMPI's internals will adjust themselves to the move.

In that instance, then this code is doing all the right things. PRRTE and PMIx will likewise use LD_LIBRARY_PATH when prte starts up, and then adjust their internals based on the translated PRTE_PREFIX and PMIX_PREFIX values. All is good.

The only case where there is a potential problem is when (a) there are multiple OMPI installations on the system, and (b) LD_LIBRARY_PATH is pointing to an OMPI version different than the one pointed to by OPAL_PREFIX. In this case, even the singleton when it starts up is going to be in trouble as the internals of the libmpi it found are going to be adjusted to point to the incorrect place. The impact of this is subtle, but we do use opal_installdirs values throughout the codebase, and they will all be pointing to the wrong location.

So perhaps the correct solution to this problem is to have an internal check of OPAL_PREFIX vs where we found libmpi to ensure things are consistent? Otherwise, the problem isn't confined to the singleton comm-spawn issue - it impacts singletons in general.

@dalcinl
Copy link
Contributor

dalcinl commented Feb 25, 2024

which means the user must have set LD_LIBRARY_PATH to point to the OMPI installation

In my particular use case, I'm not setting LD_LIBRARY_PATH at all. The MPI library is opened via dlopen from a know location (not the original ompi install prefix). Additionally, I'm editing executables with patchelf/install_name_tool to have runpath $ORIGIN/../lib/@executable_path/../lib in macOS, and libraries to have runpath $ORIGIN//@loader_path/. Anyway, I believe all this is equivalent to setting LD_LIBRARY_PATH properly. With a bit more or work on the Open MPI side, I could even manage to determine the proper OPAL_PREFIX value automatically in some platforms, let say by using dladdr() to get the full path to the shared library, and from there determine OPAL_PREFIX.

So perhaps the correct solution to this problem is to have an internal check of OPAL_PREFIX vs where we found libmpi to ensure things are consistent?

Except that being able to override the internal PMIX/PRTE with an external install may be an useful feature, allowing full reuse of pre-built binaries. However, perhaps at that point it is better to just override the whole thing, including libmpi.so.40 via LD_LIBRARY_PATH or dlopening from a different specific location.

@dalcinl
Copy link
Contributor

dalcinl commented Feb 25, 2024

In my last comment, I forgot to say that, with my limited knowledge of ompi internals, I agree with Ralph's observations about what the changes in this PR are good.

However, I'm still thinking whether forcibly overriding PRTE/PMIX_PREFIX with setenv(..., 1) is the proper thing to do. What about letting a libmpi (built with internal PMIX/PRRTE) installed at prefixA work with a PRRTE/PMIX external installation at prefixB?
I guess it all depends on what's the definition/intention of using "internal" PRRTE/PMIx. In my particular use case, I do it out of convenience to have a self-contained set of binaries that are easy to relocate on arbitrary prefixes. These binaries work out of the box if casually installed in a bare workstation or laptop. But if installed in a proper supercomputing facility, you may need to point to a system-installed PRTE/PMIx setting PRTE/PMIX_PREFIX in advance, therefore libmpi should not override the setting?

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 25, 2024

I think we are getting a tad confused here, so let's separate out the two concerns.

One: how to ensure that the MPI library is self-consistent. Whether you manually set LD_LIBRARY_PATH or not, there is always a path defined in your environment. If nothing is explicitly set, then it is the default path. Regardless, the problem remains that setting OPAL_PREFIX can cause library confusion, and that is something OMPI needs to address.

Two: what to do about singleton comm-spawn's need to find prte. Totally separate issue, though OPAL_PREFIX is still involved. In this case, we support a couple of options. The user can set OMPI_PRTERUN to explicitly define the path to the desired prte executable. Or we can infer the path using the other tools at our disposal. Regardless, it appears that the code provided here pretty much covers all the options for this case.

So what the OMPI folks need to do is figure out how they want to deal with the first concern.

@dalcinl
Copy link
Contributor

dalcinl commented Feb 25, 2024

The user can set OMPI_PRTERUN to explicitly define the path to the desired prte executable

Sorry, but I have to insist, although this time with a question.
What would happen if the user sets the variable the following way?

export OMPI_PRTERUN=/path/to/prterun

Would singleton init still work as expected, despite the fact that the tool's basename is not prte but prterun?

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 25, 2024

Then they set it incorrectly, and it will fail. One does need to follow the documentation and do things correctly - it is impossible to deal with every user error. Setting it this way would cause mpirun to fail as well, so it is being consistent.

Nobody wants to expose the details of OMPI's internals to the user. We have a documented procedure that works - let's just stick to it, please? I'd rather not continue the debate.

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 25, 2024

For you OMPI folks: please note that the direct launchers out there (e.g. srun) do not recognize OPAL_PREFIX and therefore do not use the value to modify the LD_LIBRARY_PATH provided to the applications they start. However, they do forward the environment, and thus the potential conflict between the locations pointed to by opal_installdirs and those in the path exists.

On the plus side, all the launched procs will see the same conflict. 🤷‍♂️

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 26, 2024

After discussion at the OMPI RM meeting today, it was decided that this PR fixes the identified issue and really doesn't create any new ones. The expressed concerns over OPAL_PREFIX have always existed, and have previously been dealt with via documentation - please see https://www.open-mpi.org/faq/?category=building#installdirs for a detailed explanation.

Accordingly, this should be good to go.

@bosilca
Copy link
Member

bosilca commented Feb 26, 2024

I disagree, this is absolutely not the same thing. The issue we have today is screwing ALL processes, while this patch will only screw up all newly spawned processes creating a distributed run with potentially multiple versions of OMPI.

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 26, 2024

The issue we have today is screwing ALL processes, while this patch will only screw up all newly spawned processes creating a distributed run with potentially multiple versions of OMPI.

I invite you to walk thru the logic of that statement and explain to the OMPI community how either situation results in a correctly operating application. Whether the error causes the entire job to collapse, or only causes the interaction between the singleton and its children to fail, it still fails.

You are free to argue this at tomorrow's devel meeting - I have no further opinion on the matter.

@bosilca
Copy link
Member

bosilca commented Feb 26, 2024

When it affect the entire job the entire user application will be using a different MPI library that expected, but it will not fail, or at least I do not see any obvious reason for this to fail as they will all use the same wire protocols. Yes, it is possible to load components from a different installation, but as long as they are binary compatible (which we ensure in the MCA loading scheme) all is good.

When it affects only the spawnees, it will create a heterogeneous setup where the singleton and the rest of the processes will use a different MPI library. Not components, but a different MPI library. This is more than unexpected and can lead to weird errors as the wire protocols might differ.

Now, if the user know what she's doing and the LD_LIBRARY_PATH (or the rpath) and the OPAL_PREFIX matches, all will be good in all cases. These users don't need protections, the others do.

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 27, 2024

The feeling was that the documentation explains this in detail, and has been sufficient for nearly 20 years. What you appear to be advocating for is a change in that position by adding internal detection that the prefix and library path are in conflict.

I don't have any opinion on that, but it is beyond the scope of this PR. I'm only trying to fix the filed issue where we don't find prte even though OPAL_PREFIX is correctly set.

The "old" method relies on PMIx publish/lookup followed by
a call to PMIx_Connect. It then does a "next cid" method
to get the next communicator ID, which has multiple algorithms
including one that calls PMIx_Group.

Simplify this by using PMIx_Group_construct in place of
PMIx_Connect, and have it return the next communicator ID.
This is more scalable and reliable than the prior method.

Retain the "old" method for now as this is new code. Create
a new MCA param "OMPI_MCA_dpm_enable_new_method" to switch
between the two approaches. Default it to "true" for now
for ease of debugging.

NOTE: this includes an update to the submodule pointers
for PMIx and PRRTE to obtain the required updates to
those code bases.

Signed-off-by: Ralph Castain <rhc@pmix.org>
@rhc54
Copy link
Contributor Author

rhc54 commented Mar 4, 2024

Oops - accidentally overwrote this branch. Oh well - nobody seemed inclined to take it anyway.

@rhc54 rhc54 closed this Mar 4, 2024
@dalcinl
Copy link
Contributor

dalcinl commented Mar 4, 2024

@rhc54 Any chance you can recover it? IIUC, the objection was mostly about forcibly setting OPAL_PREFIX to something that may not match singleton MPI library. There are platform-dependent alternatives to compute OPAL_PREFIX automatically, but that should be matter of another PR. Other bits of this PR were an actual improvement!, like the find_prte() utility function you added. That's good stuff that should go in.

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 4, 2024

Yeah, I had it laying around on another place, so easy to recover it. Still, if it doesn't get committed in the next week, I'll probably kill it anyway. I really dislike letting PR's sit around for months - or in some cases, years.

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 4, 2024

Reopen it for consideration - will remove it permanently on Mar 11 if not committed.

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 4, 2024

Guess I can't reopen it because it was recreated - so I opened #12390 as a replacement

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants