Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Pin PT version: Fix FPX Inductor error" #843

Merged
merged 8 commits into from
Sep 10, 2024

Conversation

msaroufim
Copy link
Member

@msaroufim msaroufim commented Sep 8, 2024

Reverts #790

The problems are all mostly cpu specific, in particular this feels like a subclass on cpu problem but not sure - cc @bdhirsh

Copy link

pytorch-bot bot commented Sep 8, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/843

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 47acb46 with merge base 1b317f9 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 8, 2024
@msaroufim
Copy link
Member Author

so fpx test is no longer failing but i see some new failures with bitnet that should probably be skipped @andrewor14 before we make the release. Also save/load is now failing on nightlies @jerryzh168

@msaroufim msaroufim merged commit e283743 into main Sep 10, 2024
17 checks passed
@msaroufim msaroufim deleted the revert-790-msaroufim-patch-17 branch September 10, 2024 14:32
@bdhirsh
Copy link
Contributor

bdhirsh commented Sep 10, 2024

@msaroufim do you have a link to the failing tests? I can try to help root cause (and/or at least figure out if it's subclass related)

@msaroufim
Copy link
Member Author

Thank you! <3

They're in test/integration/test_integration.py

test_int8_weight_only_quant_subclass_api
test_int8_weight_only_quant_with_freeze
test_save_load_int8woqtensors

And only fail for cpu

jainapurva pushed a commit that referenced this pull request Sep 10, 2024
* Revert "Pin PT version: Fix FPX Inductor error (#790)"

This reverts commit 287458c.

* udpates

* yolo

* yolo

* yolo

* yolo
@bdhirsh
Copy link
Contributor

bdhirsh commented Sep 10, 2024

A couple things:

(1) I confirmed that python test/integration/test_integration.py -k test_int8_weight_only_quant_subclass_api_1_cpu fails for me locally

(2) I tweaked the test to compile with a few different backends: compile(... backend="inductor") fails, but compile(..., backend="aot_eager_decomp_partition") runs fine

(3) i checked out @eellison's beautiful inductor bisecting tool from here and ran it. It bisected down to to the lowering for prims.convert_element_type as the "problematic" bit of inductor. This is technically useful, but doesn't really give us the full story: that lowering hasn't been changed in months, and avoiding lowering that op probably just prevents inductor from doing some other fusions that are causing the numeric difference (I bet if the bisector could bisect on inductor fusions it could tell us more).

(4) Also confirmed that this only repros with cpu inputs.

Idk, @eellison do you know any suspicious commits coming from the cpu-inductor side in the last week that could affect numerics, especially in relation to casting ops?

@leslie-fang-intel
Copy link
Contributor

Hi @bdhirsh

(1) I confirmed that python test/integration/test_integration.py -k test_int8_weight_only_quant_subclass_api_1_cpu fails for me locally

is it the same failure reported in #890

@eellison
Copy link

eellison commented Sep 12, 2024

@bdhirsh, I was on pto last week so i'm not sure. sounds like @leslie-fang-intel has an idea. Would be easier to bisect maybe.

@bdhirsh
Copy link
Contributor

bdhirsh commented Sep 12, 2024

@leslie-fang-intel not sure, but seems unlikely (mainly because the issue you linked is a hard error while the subclass is running at compile time, while this is a runtime / bad numerics error)

@leslie-fang-intel
Copy link
Contributor

leslie-fang-intel commented Sep 13, 2024

Thanks @bdhirsh. It sounds like a different error then. Maybe after we resolve the first one, we will meet the numeric error you met. Do you have any idea about the hard error in #890? Why it didn't happen in your test environment.

----------------- Update ---------------

  • PyTorch: 86335e913557962bf8d415c80dcb7e615468ba42
  • AO: 8236a87

comment out these lines to enable the test:

It seems we saw 2 different errors when running 2 different UT

@leslie-fang-intel
Copy link
Contributor

Hi @bdhirsh, I have create PRs to fix these 2 failures

  • PR in Torch: [AO][Inductor] Enable WOQ fusion pattern with permute pytorch#135928 addresses the numerical failures, which were caused by missing the fusion of this woq int8 pattern and falling back to the reference implementation. By re-enabling the fusions in Inductor for this pattern, the unit tests now pass on my local system.
  • PR in AO: Fix WOQ int8 failures #884, as for another UT failure python test/integration/test_integration.py -k test_int8_weight_only_quant_with_freeze, it seems we still need unwrap_tensor_subclass for Inductor freezing. cc @eellison here.

Please kindly help to review these 2 PRs.

@bdhirsh
Copy link
Contributor

bdhirsh commented Sep 17, 2024

@jerryzh168 I vaguely remember it being a problem to re-enable unwrap_tensor_subclass() in the torch.compile path for torchao, since dynamo had a bad interaction with the parametrization code. Does that ring a bell?

Either way, we should fix the freezing interaction (@IvanKobzarev is taking a look)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants