TORCH_CHECK used within torch.compile does not throw legible errors #126691

lezcano · 2024-05-20T16:11:00Z

🐛 Describe the bug

At best, you get a

terminate called after throwing an instance of 'c10::Error'
terminate called after throwing an instance of 'c10::Error'
terminate called after throwing an instance of 'c10::Error'
terminate called recursively
[1]    2964411 IOT instruction (core dumped)

At worst, you just get the "terminate called recursively" with not mention to c10::Error

To repro, you can perform any out of bounds error. I got it with

import torch

@torch.compile
def copy(a):
    b = torch.arange(start=1, end=-a.numel() - 1, step=-1, device=a.device)
    return a[b]
x = torch.rand(1024, 128, device="cpu")
copy(x)

when testing #114471 (may not repro in master).

Versions

#114471

The text was updated successfully, but these errors were encountered:

lezcano · 2024-05-20T16:11:10Z

cc @jgong5

lezcano · 2024-05-21T10:27:16Z

It doesn't happen for every input. The following input throws a nice error:

@torch.compile
def fn(x, y):
    b = torch.arange(x.size(0) - 1, device=x.device) + 3
    return x[b] + y[b]
x = torch.rand(1024, 128, device="cpu")
y = torch.rand(1024, 128, device="cpu")
fn(x, y)

Note that the indexing error in the OP is much more egregious than the one in this second example.

jgong5 · 2024-05-21T11:33:55Z

when testing #114471 (may not repro in master).

How to reproduce it? It cannot be repro on master.

lezcano · 2024-05-21T11:39:07Z

Patch in that PR. There are quite a few issues in master when it comes to issuing device_asserts. See that in the repro in the OP we don't even generate a TORCH_CHECK and rather we just read out of bounds.

zhuhaozhe · 2024-06-04T05:38:52Z

I have compared the difference with ATen ops, we need to change 2 things to throw as legible errors as aten does here:

Inductor path dose not translate c++ exception into python like aten
Inductor path do not have logic to catch exception inside omp paralell region, while at::parallel_for can catch and re throw it. The difference behavior of difference inputs is also because 1 will encounter omp parallel and one is not.

I will submit a PR for it.

zhuhaozhe · 2024-06-07T05:11:00Z

Submitted a PR here. #127868

lezcano added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module oncall: cpu inductor CPU Inductor issues for Intel team to triage labels May 20, 2024

leslie-fang-intel assigned zhuhaozhe May 22, 2024

zhuhaozhe mentioned this issue Jun 4, 2024

[inductor] use same method to handle exception with aten #127868

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TORCH_CHECK used within torch.compile does not throw legible errors #126691

TORCH_CHECK used within torch.compile does not throw legible errors #126691

lezcano commented May 20, 2024

lezcano commented May 20, 2024

lezcano commented May 21, 2024 •

edited

jgong5 commented May 21, 2024

lezcano commented May 21, 2024

zhuhaozhe commented Jun 4, 2024 •

edited

zhuhaozhe commented Jun 7, 2024

TORCH_CHECK used within torch.compile does not throw legible errors #126691

TORCH_CHECK used within torch.compile does not throw legible errors #126691

Comments

lezcano commented May 20, 2024

🐛 Describe the bug

Versions

lezcano commented May 20, 2024

lezcano commented May 21, 2024 • edited

jgong5 commented May 21, 2024

lezcano commented May 21, 2024

zhuhaozhe commented Jun 4, 2024 • edited

zhuhaozhe commented Jun 7, 2024

lezcano commented May 21, 2024 •

edited

zhuhaozhe commented Jun 4, 2024 •

edited