Integrating Riemannian Preconditioner #1807

fangzhaozhang · 2024-05-28T23:00:45Z

Paper link: https://arxiv.org/pdf/2402.02347
This is an attempt to integrate a special optimizer for LoRA training to current huggingface peft codebase. We follow structure in PR to add LoRA+ (#1509).

fangzhaozhang · 2024-05-28T23:49:45Z

we have added a test file in peft/tests/riemannian_test.py which uses the new optimizer for training a LLM using trainer class.

BenjaminBossan

Thanks a lot for creating this draft PR ot add Riemannian AdamW. I did a first review but haven't looked at the exact implementation details and compared to the paper yet. I added some comments which, if addressed, will help me better understand what's going on.

Apart from the code comments I added, I have some more general comments:

This PR contains the code from the lora+ PR. Please remove it.
Could you please run make style?
If some of this code is copied over from https://github.com/pilancilab/Riemannian_Preconditioned_LoRA or elsewhere, please add a comment with a reference.
You added a test but it does not have the form of a proper unit test. I think it would be better to rewrite this a bit and add it to the examples/ directory, as it's more akin to an example.
Regarding proper unit tests, check out the tests from the lora+ PR. LMK if you need more guidance.

I know that overall, this seems to be a lot of work, but I'm sure we can get this into a good shape. If you have any questions, don't hesitate to ask.

BenjaminBossan · 2024-05-29T14:00:49Z

src/peft/optimizers/riemannian.py

+        model (`torch.nn.Module`): The model to be optimized.
+        optimizer_cls (`torch.optim.Optimizer`): The optimizer class to be used.
+        optimizer_kwargs (`dict`): Additional keyword arguments to be passed to the optimizer.
+            - lr_embedding (`float`): The learning rate to be used for the embedding layer. Defaults to lr_embedding


Let's use the same indentation and syntax as the other parameters. Also, let's add docs for reg.

Hmm, indentation is still wrong. It should be:

optimizer_kwargs (`dict`): Additional keyword arguments to be passed to the optimizer. lr_embedding (`float`): The learning rate to be used for the embedding layer. Defaults to lr_embedding reg (`float`): Regularization parameter for Riemmanian preconditioner. Included for lora parameters only

BenjaminBossan · 2024-05-29T14:01:16Z

src/peft/optimizers/riemannian.py

+            - lr_embedding (`float`): The learning rate to be used for the embedding layer. Defaults to lr_embedding
+    """
+
+    """TEST VERSION FOR ADAMW"""


For code comments, use # and not strings.

BenjaminBossan · 2024-05-29T14:03:26Z

src/peft/optimizers/riemannian.py

+    """
+
+    """TEST VERSION FOR ADAMW"""
+    assert optimizer_cls.__name__=='AdamW', 'TEST version only supports AdamW optimizer'


Let's not use assert in code (only tests). Here, it is better to raise a TypeError. Also, I wonder: does the class have to be AdamW or can it be a subclass? If the latter, you can change the check to: if not issubclass(optimizer_cls, torch.optim.AdamW).

BenjaminBossan · 2024-05-29T14:03:55Z

src/peft/optimizers/riemannian.py

+    for name, param in model.named_parameters():
+        if not param.requires_grad:
+            continue
+        # print(name, param.shape)


Please remove.

BenjaminBossan · 2024-05-29T14:04:25Z

src/peft/optimizers/riemannian.py

+    """
+    Creates a Riemmanian optimizer.
+    Implementation: https://github.com/pilancilab/Riemannian_Preconditioned_LoRA
+    Reference:  https://arxiv.org/pdf/2402.02347


Let's mention that this only works for LoRA.

BenjaminBossan · 2024-05-29T14:29:00Z

src/peft/optimizers/riemannian.py

+
+        for group in self.param_groups:
+            if group['is_lora']:
+                for p1, p2 in list(zip(group["params"],group["params"][1:]))[::2]:


Let me try to understand this: I think we iterate over pairs of lora_A and lora_B, which is why we have the zip and the [::2]. Is that it?

I wonder if we can make the assumption that pairs of lora_A and lora_B are always following consecutively. E.g. what would happen if we have use_dora=True, could it happen that we now suddenly have triplets?

Your understanding is correct. This is exactly what I'm concerned/worried about. Since in our paper, for each lora pair (lora_A, lora_B), what we do is to use grad(lora_A)@ inverse(lora_B'lora_B) in place of vanilla grad(lora_A). For our paper's results, we just test and observe this changed gradient is better than vanilla gradient with respect to loss minimization. Moreover, since lora_B'lora_B is of shape r*r, then inverse(lora_B'lora_B) is expected to not take long, especially for small r. Our original implementation is basic and we just iterate like [::2].

In its development, I'm not sure how to pair up (lora_A,lora_B) in an error-free way, as you mentioned, for DoRA, since we also have the magnitude term, I feel it's better for us to actually got these pairs by matching the name, i.e., "layer1_attentionq_lora_A" and "layer1_attentionq_lora_B"? This is also better for order keeping since I feel we cannot assume each lora_A is followed by its corresponding lora_B.

Moreover, the [::2] indeed takes long compared to simple AdamW loop, thus in addition to the inverse operator, we actually also suffer from the loop runtime overhead. Shall we indeed keep some dict for lora_A and lora_B parameters respectively and directly query the corresponding value by index when needed?

BenjaminBossan · 2024-05-29T14:30:44Z

src/peft/optimizers/riemannian.py

+                for p1, p2 in list(zip(group["params"],group["params"][1:]))[::2]:
+                    grad = p1.grad
+                    if grad.is_sparse:
+                        raise RuntimeError("Adam does not support sparse gradients, please consider SparseAdam instead")


Suggested change

raise RuntimeError("Adam does not support sparse gradients, please consider SparseAdam instead")

raise RuntimeError(f"{self.__class__.__name__} does not support sparse gradients")

Not sure if it makes sense to suggest SparseAdam here.

BenjaminBossan · 2024-05-29T14:31:22Z

src/peft/optimizers/riemannian.py

+                        reg_I = self.defaults['reg']*torch.eye(min(p2.shape)).to(p2.device)
+                        scaler = torch.inverse(scaler@scaler.T+reg_I) if p2.shape[0]<p2.shape[1] \
+                                                    else torch.inverse(scaler.T@scaler+reg_I)
+                        assert scaler.shape[0]==min(p2.data.shape), 'wrong dimension'


Again, let's not use assert but raise a proper error here (ValueError with a useful message).

BenjaminBossan · 2024-05-29T14:31:30Z

src/peft/optimizers/riemannian.py

+                                                    else torch.inverse(scaler.T@scaler+reg_I)
+                        assert scaler.shape[0]==min(p2.data.shape), 'wrong dimension'
+                    except:
+                        print('invalid condition')


BenjaminBossan · 2024-05-29T14:34:40Z

src/peft/optimizers/riemannian.py

+                    if group["weight_decay"] > 0.0:
+                        p2.add_(p2, alpha=(-group["lr"] * group["weight_decay"]))
+
+            else:     


Is this code path normal AdamW or are there changes in here too? Adding a comment would be helpful.

BenjaminBossan · 2024-06-27T08:57:41Z

@fangzhaozhang do you still plan on working on this?

fangzhaozhang · 2024-06-28T17:43:43Z

Yes, I'm going to implement the unit test this weekend. Sorry about the delay since I'm recently on some other research work.

…

On Thu, Jun 27, 2024 at 1:58 AM Benjamin Bossan ***@***.***> wrote: @fangzhaozhang <https://github.com/fangzhaozhang> do you still plan on working on this? — Reply to this email directly, view it on GitHub <#1807 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AROEEJXZDDHZGOWRML52DF3ZJPHZZAVCNFSM6AAAAABINZWCHWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJUGE2TGMZUG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Fangzhao Zhang 250-899-2965

…onditioned_LoRA

fangzhaoz · 2024-07-15T20:51:02Z

I'm back on the implementation. Thanks so much for your detailed comments. With respect to the general points,

I've removed lora plus code
I've run make style
I've added reference link to our original implementation
I've moved the prior test to examples/riemannian_lora and I rewrite a test in tests/test_riemannian_lora.py follow lora plus's tests/test_loraplus_helper.py. Lmk whether this is the desired unit test form.

I've also fixed small issues such as code comments, function name, etc. as suggested in the comments above. However, I'm not very sure about the following point:

Our current implementation is a rewrite of transformer's AdamW https://github.com/huggingface/transformers/blob/v4.42.0/src/transformers/optimization.py#L558, shall we instead follow torch.optim.AdamW implementation, which is more complete though complex?
Our method has a pretty different logic from lora plus, lora plus serves as an optimizer wrapper by just changing the learning rate setting, we are more close to writing a new optimizer customized to LoRA instead since we are changing the optimizer's inner workflow. lora plus is integrable to all optimizers such as Adam,AdamW,Adagrad, etc., our paper only described modifications to SGD and AdamW instead. Thus I'm not sure whether it's best to make our method appear in peft/optimizers in parallel with lora plus, it feels more natural to get our optimizer in parallel with AdamW implementation or just pass in a parameter like lora=True to transformer's AdamW in order to switch to our method. Besides, our method is not directly applicable to bitsandbytes and other quantized form since torch.inverse() is only compliant with certain dtype. Then shall we also do a dtype conversion before and after we compute torch.inverse() to make it more general?
The iteration method also confuses me, shall we change to dict of lora_A/lora_B and query them by indexing compared to current [::2] setting?

Would be glad to hear from your feedback/suggestions on the above questions.

BenjaminBossan

Thanks a lot for the updates. We're getting closer but there are still a few areas that need to be improved.

Also, note that the LoRA+ PR is now moved to #1915 with a few changes.

Thus I'm not sure whether it's best to make our method appear in peft/optimizers in parallel with lora plus, it feels more natural to get our optimizer in parallel with AdamW implementation or just pass in a parameter like lora=True to transformer's AdamW in order to switch to our method

Since this is very PEFT specific, I think the best fit is indeed here. It would be quite hard to convince transformers to add this very specific change.

2. Besides, our method is not directly applicable to bitsandbytes and other quantized form since torch.inverse() is only compliant with certain dtype. Then shall we also do a dtype conversion before and after we compute torch.inverse() to make it more general?

If you can implement a version that works with quantized weights, that would be great. If not, that's also okay, but then let's document this clearly.

BenjaminBossan · 2024-07-17T09:56:41Z

src/peft/optimizers/__init__.py

+# flake8: noqa
+# There's no way to ignore "F401 '...' imported but unused" warnings in this
+# module, but to preserve other warnings. So, don't check this module at all
+
+# coding=utf-8


These lines can be removed. At the bottom of the file, add __all__ = ["create_riemannian_optimizer"]

BenjaminBossan · 2024-07-17T10:24:39Z

src/peft/optimizers/__init__.py

+# module, but to preserve other warnings. So, don't check this module at all
+
+# coding=utf-8
+# Copyright 2023-present the HuggingFace Inc. team.


Suggested change

# Copyright 2023-present the HuggingFace Inc. team.

# Copyright 2024-present the HuggingFace Inc. team.

BenjaminBossan · 2024-07-17T10:26:13Z

src/peft/optimizers/riemannian.py

+        model (`torch.nn.Module`): The model to be optimized.
+        optimizer_cls (`torch.optim.Optimizer`): The optimizer class to be used.
+        optimizer_kwargs (`dict`): Additional keyword arguments to be passed to the optimizer.
+            - lr_embedding (`float`): The learning rate to be used for the embedding layer. Defaults to lr_embedding


Hmm, indentation is still wrong. It should be:

optimizer_kwargs (`dict`): Additional keyword arguments to be passed to the optimizer. lr_embedding (`float`): The learning rate to be used for the embedding layer. Defaults to lr_embedding reg (`float`): Regularization parameter for Riemmanian preconditioner. Included for lora parameters only

BenjaminBossan · 2024-07-17T10:32:09Z

src/peft/optimizers/riemannian.py

+    if not issubclass(optimizer_cls, torch.optim.AdamW):
+        raise TypeError("TEST version only supports AdamW optimizer")


Since the optimizer_cls argument is not actually except to raise an error, how about removing it completely?

BenjaminBossan · 2024-07-17T10:33:07Z

src/peft/optimizers/riemannian.py

+def create_riemannian_optimizer(
+    model: PeftModel,
+    optimizer_cls: type[Optimizer],
+    optimizer_kwargs: dict,


Since you probably took this from the LoRA+ PR, let me refer to the comment I put there:

A suggestion: Let's remove optimizer_kwargs and just add **kwargs. IMO, that makes calling this function easier, as we can use create_riemannian_optimizer(..., weight_decay=1e-3) instead of create_riemannian_optimizer(..., optimizer_kwargs={..., "weight_decay": 1e-3}). And since lr is not optional, let's make this a normal arg of create_riemannian_optimizer.

BenjaminBossan · 2024-07-17T10:52:29Z

src/peft/optimizers/riemannian.py

+        for group in self.param_groups:
+            if group["is_lora"]:
+                for p1, p2 in list(zip(group["params"], group["params"][1:]))[::2]:


As discussed in the other comment, this is indeed error prone. For this, the logic here:

https://github.com/huggingface/peft/pull/1807/files#diff-4730f831ea49f19ef126ffa6d712865c57a477585e4098b74acb6026d3056d5aR46-R47

should be improved. I think it's better if we create two separate groups for lora_A and lora_B. After the loop there, let's also check that both groups have the same length and that the length is > 0. In the optimizer_grouped_parameters, we can set "is_lora_A": True and "is_lora_B": True accordingly.

After making this change, the line here could be simplified to:

# this works because there is exactly one lora_A and one lora_B group lora_A_params = next(group for group in self.param_groups if group["is_lora_A"]) lora_B_params = next(group for group in self.param_groups if group["is_lora_B"]) for p1, p2 in zip(lora_A_params, lora_B_params):

BenjaminBossan · 2024-07-17T10:54:10Z

src/peft/optimizers/riemannian.py

+                            if p2.shape[0] < p2.shape[1]
+                            else torch.inverse(scaler.T @ scaler + reg_I)
+                        )
+                        assert scaler.shape[0] == min(p2.data.shape), "wrong dimension"


Let's not use assert, instead raise a proper ValueError with a helpful message.

BenjaminBossan · 2024-07-17T10:54:23Z

src/peft/optimizers/riemannian.py

+                            if p1.shape[0] < p1.shape[1]
+                            else torch.inverse(scaler.T @ scaler + reg_I)
+                        )
+                        assert scaler.shape[0] == min(p1.data.shape), "wrong dimension"


Let's not use assert, instead raise a proper ValueError with a helpful message.

BenjaminBossan · 2024-07-17T10:55:58Z

src/peft/optimizers/riemannian.py

+                            else torch.inverse(scaler.T @ scaler + reg_I)
+                        )
+                        assert scaler.shape[0] == min(p2.data.shape), "wrong dimension"
+                    except RuntimeError:


Could you explain why this is needed? Could we instead check the condition and do something like if valid_condition: ... else: scaler = None. Let's completely avoid printing messages.

BenjaminBossan · 2024-07-17T10:56:18Z

src/peft/optimizers/riemannian.py

+                        )
+                        assert scaler.shape[0] == min(p1.data.shape), "wrong dimension"
+                    except RuntimeError:
+                        print("invalid condition")


Could you explain why this is needed? Could we instead check the condition and do something like if valid_condition: ... else: scaler = None. Let's completely avoid printing messages.

kallewoof · 2024-07-18T10:46:34Z

Cool! We should ensure that we add documentation clarifying whether this works together with LoRA+ or whether the two are mutually exclusive for some reason.

github-actions · 2024-08-11T15:03:32Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

BenjaminBossan · 2024-08-12T08:44:02Z

@fangzhaozhang Do you still plan on finishing this PR?

github-actions · 2024-09-13T15:03:38Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

initial draft

0852cfa

fangzhaozhang marked this pull request as draft May 28, 2024 23:04

add riemannian test

550553a

fangzhaozhang mentioned this pull request May 28, 2024

Add Special Optimizer for LoRA training #1803

Closed

Update riemannian_test.py

dd32e8b

BenjaminBossan requested changes May 29, 2024

View reviewed changes

fangzhaozhang added 6 commits July 15, 2024 10:37

cleanup loraplus code and run make file

3dc3f9e

add proper reference to https://github.com/pilancilab/Riemannian_Prec…

dd7e85c

…onditioned_LoRA

modify tests and move prior test to examples

735e386

make style

445412c

small fixes based on comment feedback

4526263

fix more suggestions in comment feedbacks

3017a5d

BenjaminBossan requested changes Jul 17, 2024

View reviewed changes

	raise RuntimeError("Adam does not support sparse gradients, please consider SparseAdam instead")
	raise RuntimeError(f"{self.__class__.__name__} does not support sparse gradients")

	# Copyright 2023-present the HuggingFace Inc. team.
	# Copyright 2024-present the HuggingFace Inc. team.

		if not issubclass(optimizer_cls, torch.optim.AdamW):
		raise TypeError("TEST version only supports AdamW optimizer")

Integrating Riemannian Preconditioner #1807

Are you sure you want to change the base?

Integrating Riemannian Preconditioner #1807

Conversation

fangzhaozhang commented May 28, 2024 • edited Loading

fangzhaozhang commented May 28, 2024

BenjaminBossan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenjaminBossan commented Jun 27, 2024

fangzhaozhang commented Jun 28, 2024 via email

fangzhaoz commented Jul 15, 2024

BenjaminBossan left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kallewoof commented Jul 18, 2024

github-actions bot commented Aug 11, 2024

BenjaminBossan commented Aug 12, 2024

github-actions bot commented Sep 13, 2024

fangzhaozhang commented May 28, 2024 •

edited

Loading

BenjaminBossan left a comment •

edited

Loading