Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coll/acoll: Remove use of cid as array index #12783

Merged
merged 1 commit into from
Sep 17, 2024

Conversation

amd-nithyavs
Copy link
Contributor

This PR removes the use of context id as array index in mca_coll_acoll_module_t. Previously, coll_acoll_subcomms_t subc[] was indexed through context id of the communicator. Now, subc is dynamically allocated for each new communicator (for up to 10 communicators), thereby eliminating the need to use context id as array index.

There is no change in performance of the collective APIs with this change.

@rhc54
Copy link
Contributor

rhc54 commented Aug 29, 2024

So....what happens when the 11th communicator appears?

@amd-nithyavs
Copy link
Contributor Author

So....what happens when the 11th communicator appears?

When it exceeds 10 communicators, it falls back to base functions.

@rhc54
Copy link
Contributor

rhc54 commented Sep 4, 2024

When it exceeds 10 communicators, it falls back to base functions.

And the user will know this happened - how? How will it impact them?

What I'm getting at is that you are introducing a hard limit into the OMPI code base, and users are going to have a difficult time realizing what is happening and trying to figure out a way around it. Historically, we have found that such hard limits tend to generate user unhappiness - so perhaps some justification for why it is necessary here would help? You state that there is no performance benefit, so why do this?

Something like (a) what problem is this solving, (b) how the 10 communicators are impacted by the change, (c) how a communicator/performance is impacted when you fall back to the base, (d) how a user can expand that number if the limit is hurting them, etc.

@edgargabriel
Copy link
Member

@amd-nithyavs would it be possible to make the max. number of subcommunicators that you want to store an mca-parameter that the user could adjust, and allocate the array dynamically during acoll_module_enable() ?

@amd-nithyavs
Copy link
Contributor Author

@amd-nithyavs would it be possible to make the max. number of subcommunicators that you want to store an mca-parameter that the user could adjust, and allocate the array dynamically during acoll_module_enable() ?

Done.

Copy link
Member

@edgargabriel edgargabriel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good to me, just a few minor nits

err = check_and_create_subc(comm, acoll_module, &subc);

/* Fallback to linear if subcomms structure is not obtained */
if (subc == NULL) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit: the Open MPI coding standard require to have the constant as the first argument of the comparison, so this should be

if (NULL == subc) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

/* Obtain the subcomms structure */
err = check_and_create_subc(comm, acoll_module, &subc);
/* Fallback to knomial if subcomms is not obtained */
if (subc == NULL) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dito here (there a few more later in this pr, I assume you will check all comparisons in this PR to adjust this)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


/* Subcomms structure is not present, create one if within limit*/
if (num_subc == acoll_module->max_comms) {
*subc_ptr = NULL;
Copy link
Member

@edgargabriel edgargabriel Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add here an opal_output() warning to the user that we are using the fallback path and suggest that he can increase the max no. of communicators handled by the component using the mca parameter? This warning could be displayed with verbosity level turned on or similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

The use of cid as array index is removed. Instead, now
coll_acoll_subcomms_t is dynamically allocated for each new
communicator.

Signed-off-by: Nithya V S <Nithya.VS@amd.com>
Copy link
Member

@edgargabriel edgargabriel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@edgargabriel edgargabriel merged commit 5f00259 into open-mpi:main Sep 17, 2024
14 checks passed
@amd-nithyavs amd-nithyavs deleted the 29Aug2024_nocid_array branch September 18, 2024 08:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants