Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MoE #639

Open
wants to merge 141 commits into
base: main
Choose a base branch
from
Open

MoE #639

wants to merge 141 commits into from

Conversation

Muennighoff
Copy link
Collaborator

@Muennighoff Muennighoff commented Jun 30, 2024

Replaces #541

Notes:

  • I didn't find norm_after to work well but added it to conform with other parts of the code but can also remove it
  • Only left in the config file used for the final 5T run
  • I didn't include all configurations that we ran for OLMoE (e.g. expert choice) - I will probably put instructions for those in a separate olmoe repository for people who want to exactly reproduce

from megablocks.layers.moe import MoE
except ImportError:
raise ImportError(
"To train MoEs, run `pip install git+https://github.com/Muennighoff/megablocks.git@olmoe`"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's different about your branch for the original source?

Copy link
Collaborator Author

@Muennighoff Muennighoff Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It includes zloss which we use during training for better stability

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can view the exact difference here: databricks/megablocks@main...Muennighoff:megablocks:olmoe ; besides zloss it also has expert choice which is currently not used but i think we may want to try in the future when we go multimodal

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you upstream this, so we don't have to depend on a private fork?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, opened a PR here databricks/megablocks#133 - If / when it gets merged, I will update the install instructions. If people don't want to use zloss, it also works with the regular megablocks - it's not a big difference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Muennighoff , so they decided to merge their version instead. Is our version compatible? Will the model you trained work with their implementation of zloss?

The number of experts to use in the MoE block.
"""

moe_top_k: Optional[int] = 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these are Optional, what does it mean when it's None?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're optional when no MoE is used, otherwise required. Is this not an acceptable usage of Optional[int]? Can change it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, when we have a config setting that is not always required we should either 1) always make it optional type, set it to None by default, and set it in every config when it is needed; or 2) don't make it optional type unless None is needed. I prefer 1 since it makes our config more readable (less irrelevant settings) and slightly more backwards compatible.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change it to option 1) if others agree? Note that there's other params not following this:

    embedding_size: Optional[int] = 50304
    gen1_gc_interval: Optional[int] = 1
    distributed_strategy: Optional[DistributedStrategy] = DistributedStrategy.fsdp
    fsdp: Optional[FSDPConfig] = field(default_factory=FSDPConfig)
    auxiliary_loss_multiplier: Optional[float] = 1e-4

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you actually rely on the defaults you put in here anywhere? If not, let's go with Shane's version, and default these to None. I assume something somewhere will fail if they are not set and you need them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you actually rely on the defaults you put in here anywhere?

Yes quite a lot, e.g. the loss weights; the use of dropless MoEs (moe_dropless); leaving moe_interleave,moe_lbl_in_fp32,moe_shared_expert as False

Actually, I don't think setting them all to None is a good idea, as it means that everytime we add a new MoE-specific configuration parameter all MoE configs become outdated since every MoE-specific configuration parameter is Optional in that dense.

I can also remove the Optional from it as they have defaults anyways but then as seen in the examples I pasted above, we do have Optional config params with default values in the codebase anyways.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it doesn't break everything, I'd prefer to have a special config object for MoE, which is Optional, but none of the items inside of that object are Optional. This may break backwards compatibility with the model we already released though?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it would break compat with the configs we released but can pin a commit to our released repo if people want to reuse our configs to reproduce things exactly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, that's unfortunate, but I think I prefer the MoEConfigObject. It reduces the impact on old-school dense model training.

olmo/initialization.py Show resolved Hide resolved
from megablocks.layers.moe import MoE
except ImportError:
raise ImportError(
"To train MoEs, run `pip install git+https://github.com/Muennighoff/megablocks.git@olmoe`"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you upstream this, so we don't have to depend on a private fork?

x = self._activation_checkpoint_fn(self.ff_norm, x) # type: ignore
else:
x = self.ff_norm(x)
# Activation checkpointing for the MoE FFN is not supported
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? If there is a technical problem with it, will it affect whole_layer activation checkpointing as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It fails with

torch.utils.checkpoint.CheckpointError: torch.utils.checkpoint: Unpack is being triggered for a tensor that was already unpacked once. If you are calling ctx.saved_tensors in backward, make sure to do so only once. Otherwise please open an issue with details on your use case. 2024-05-15T20:15:01.172963498Z 2024-05-15 13:15:01.171 jupiter-cs-aus-133.reviz.ai2.in:3 olmo.util:158 CRITICAL Uncaught CheckpointError: torch.utils.checkpoint: Unpack is being triggered for a tensor that was already unpacked once. If you are calling ctx.saved_tensors in backward, make sure to do so only once. Otherwise please open an issue with details on your use case.

This paper has some explanations why it is difficult to do act ckpt for MoEs: https://dspace.mit.edu/bitstream/handle/1721.1/153897/wisdom-dwisdom-meng-eecs-2024-thesis.pdf

whole_layer is not supported with MoE, only fine_grained - I added code to raise an error if it's not fine_grained & MoE is configured.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see. Interesting. It would be fixable I think (by saving the active experts per token in the forward pass), but out of scope for this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a fairly big blocker to going bigger though. For dense models, our fastest settings still use a lot of checkpointing.

olmo/model.py Outdated Show resolved Hide resolved
olmo/train.py Outdated Show resolved Hide resolved
olmo/train.py Show resolved Hide resolved
olmo/train.py Outdated Show resolved Hide resolved
scripts/train.py Outdated Show resolved Hide resolved
@Muennighoff Muennighoff requested a review from dirkgr August 3, 2024 01:33
@Muennighoff
Copy link
Collaborator Author

Linking this related PR that we should merge after: #707

If this PR here looks good to you, could you approve it @epwalsh / @dirkgr ? :)

The number of experts to use in the MoE block.
"""

moe_top_k: Optional[int] = 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, when we have a config setting that is not always required we should either 1) always make it optional type, set it to None by default, and set it in every config when it is needed; or 2) don't make it optional type unless None is needed. I prefer 1 since it makes our config more readable (less irrelevant settings) and slightly more backwards compatible.

@@ -1273,3 +1334,41 @@ def update_legacy_settings(cls, config: D) -> D:
new_config.optimizer = OptimizerConfig.update_legacy_settings(new_config.optimizer)

return new_config


def config_to_moe_args(config: ModelConfig) -> Dict[str, Any]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to have this as an instance method of ModelConfig that can be invoked with something like config.build_moe_args()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the moe args may include things outside of the ModelConfig in the future. Currently, I put some things that may be considered as TrainingConfig params like moe_zloss_weight in the ModelConfig but in case we move them in the future to TrainingConfig then it would not only use the ModelConfig anymore.

olmo/model.py Outdated Show resolved Hide resolved
olmo/model.py Outdated Show resolved Hide resolved
olmo/optim.py Outdated Show resolved Hide resolved
configs/official/OLMoE-7B-A1B.yaml Outdated Show resolved Hide resolved
configs/official/OLMoE-7B-A1B.yaml Outdated Show resolved Hide resolved
Muennighoff and others added 3 commits August 20, 2024 20:27
Co-authored-by: Shane A <shanea@allenai.org>
Co-authored-by: Shane A <shanea@allenai.org>
@Muennighoff
Copy link
Collaborator Author

All tests are passing except the GPU test which I assume is expected to fail. Feel free to merge 😊

Comment on lines +496 to +499
moe_interleave: Optional[bool] = False
"""
Interleave sequential with MoE blocks starting with sequential.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You tried this? Do we need this setting? I am interested in interleaving, especially with SSM layers, but I don't think we'd want to do it like this. If we don't need this for any config you have run or described in the paper, I'd rather take out this functionality.

device=config.init_device,
)
self.ff_out._is_residual = True # type: ignore
if self.config.block_type != BlockType.moe:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this dependent on whether the block has a ff_out, instead of the block type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with hasattr(), I mean

from megablocks.layers.moe import MoE
except ImportError:
raise ImportError(
"To train MoEs, run `pip install git+https://github.com/Muennighoff/megablocks.git@olmoe`"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Muennighoff , so they decided to merge their version instead. Is our version compatible? Will the model you trained work with their implementation of zloss?

Comment on lines +734 to +735
if hasattr(self.ffn.experts.mlp, "v1"):
init_normal(self.ffn.experts.mlp.v1, std=in_std, init_cutoff_factor=cutoff_factor)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is v1 and why is it optional? Maybe I'll keep reading and find out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is when SwiGLU is activated with MoEs

x = self._activation_checkpoint_fn(self.ff_norm, x) # type: ignore
else:
x = self.ff_norm(x)
# Activation checkpointing for the MoE FFN is not supported
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see. Interesting. It would be fixable I think (by saving the active experts per token in the forward pass), but out of scope for this PR.

Comment on lines +1258 to +1264
if self.config.moe_interleave:
blocks = []
for i in range(config.n_layers):
if i % 2 == 0:
blocks.append(OLMoSequentialBlock(i, config, self.__cache))
else:
blocks.append(OLMoEBlock(i, config, self.__cache))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't need this, I'd rather not have it.

(self.model.config.block_type != BlockType.moe)
or (self.model.config.moe_log_expert_assignment is False)
)
else torch.zeros((self.model.config.n_layers, self.model.config.moe_num_experts))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not put it on CPU. This puts it on the default device.

# Run backward pass.
loss.backward()

# Remove output hooks
for hook in output_hooks:
hook.remove()

return ce_batch_loss, z_batch_loss
return ce_batch_loss, z_batch_loss, lb_batch_loss, moe_z_batch_loss, expert_assignments
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@epwalsh, does the new trainer support all of this stuff? This seems like a lot of extra things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not directly but I think it could be supported through the callback system.

lb_loss = batched_load_balancing_loss(self.moe_args) / len(micro_batches)
if self.model.config.moe_log_expert_assignment:
if self.model.config.moe_zloss_weight:
tokens_per_expert, _, _ = zip(*get_load_balancing_loss())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Goes get_load_balancing_loss() take care of reducing the expert assignments across ranks?

tokens_per_expert, _, _ = zip(*get_load_balancing_loss())
else:
tokens_per_expert, _ = zip(*get_load_balancing_loss())
expert_assignments += torch.stack(tokens_per_expert, dim=0).cpu()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal, but are you sure that this is faster on CPU? Back in the day I always thought this too, keep small stuff on the CPU, but in practice doing it all on GPU was always faster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If tokens_per_expert were on GPU this will trigger a host-device sync. In that case it's almost certainly better to keep on GPU.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

were on GPU this will trigger a host-device sync

Don't we want to avoid these syncs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants