Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when running benchmark after rustification #381

Open
tgiani opened this issue May 23, 2024 · 10 comments
Open

Segmentation fault when running benchmark after rustification #381

tgiani opened this issue May 23, 2024 · 10 comments
Labels
bug Something isn't working help wanted Extra attention is needed question Further information is requested rust Rust extension related

Comments

@tgiani
Copy link
Contributor

tgiani commented May 23, 2024

When running benchmarcks after rustification I get a segmentation fault error

To Reproduce

  1. ./rustify.sh
  2. poe compile
  3. poe lha -m "ffns and lo and not sv"

Expected behavior
On my machine the code crashes with Fatal Python error: Segmentation fault

@tgiani tgiani added help wanted Extra attention is needed question Further information is requested labels May 23, 2024
@tgiani tgiani changed the title segmentation fault when running benchmark after rustification Segmentation fault when running benchmark after rustification May 23, 2024
@alecandido
Copy link
Member

alecandido commented May 24, 2024

This is for sure related to #369, but in these cases is better to give the exact commit to make the issue reproducible.

Rust is pretty much aggressive in preventing memory issues, and Python code on its own can not cause Segmentation fault (it should come from an underlying compiled extension).

So, most likely the problem is generated in some unsafe code, and it should be all in a single file:

pub unsafe extern "C" fn rust_quad_ker_qcd(u: f64, rargs: *mut c_void) -> f64 {

type PyQuadKerQCDT = unsafe extern "C" fn(

pub unsafe extern "C" fn my_py(

pub unsafe extern "C" fn empty_qcd_args() -> QuadQCDargs {

(it's the code used to bridge the Rust callables with the SciPy quadrature integration)

@felixhekhorn
Copy link
Contributor

felixhekhorn commented May 24, 2024

@tgiani can you please check on master?


@niclaurenti @andreab1997 @giacomomagni can someone with a Mac please check this on b7d4e8d?


@alecandido we have some hints that this is actually related to this line (which is master)

areas = nb.carray(areas_raw, (areas_x, areas_y))

i.e. it seems related to the areas definition; the workflow is as follows:

  1. we define the areas in python as a 2D array
  2. the array is passed via FFI to Rust
    + areas_ffi = ekors.ffi.new(
    + f"double[{areas_len}]", curareas.flatten().tolist()
    + )
  3. the array is passed untouched back to python
    args.areas,
  4. the array is read back in python with the line above

@felixhekhorn felixhekhorn added bug Something isn't working rust Rust extension related labels May 24, 2024
@tgiani
Copy link
Contributor Author

tgiani commented May 24, 2024

@felixhekhorn I have the same problem on master

@alecandido
Copy link
Member

Here there are a bunch of raw pointers around, due to SciPy interface

However, in principle every structure should have a C-compatible representation, and every function should respect the C calling convention.
So, if everything is compiled in a compatible way, and linked together, it's supposed to work.

However, I fear it could somehow be a Mac-related problem, or even a Conda-related one, so I'm trying to test on both Linux and MacOS (but I'm using Nix, not Conda, so let's see what will be the outcome).

@tgiani
Copy link
Contributor Author

tgiani commented May 24, 2024

@alecandido I m not using conda here, just poetry, so probably that s not the problem

@alecandido
Copy link
Member

alecandido commented May 24, 2024

On 1a43a7d (current #369), I'm actually passing on Linux, while on MacOS I'm getting the following error:

benchmarks/lha_paper_bench.py  ────────────────────────────────────────
  Theories: 1 OCards: 1 PDFs: 1 ext: LHA
 ────────────────────────────────────────
Computing for theory=88d687f, ocard=8beae75 and pdf=ToyLH ...
Exception ignored in: '<numba.core.cpu.CPUContext object at 0x1656c8cd0>'
Traceback (most recent call last):
  File "/Users/alessandro/Projects/pdf/eko/.venv/lib/python3.11/site-packages/numba/cpython/numbers.py", line 1100, in complex_div
    raise ZeroDivisionError("complex division by zero")
ZeroDivisionError: complex division by zero
Exception ignored in: '<numba.core.cpu.CPUContext object at 0x1656c8cd0>'
Traceback (most recent call last):
  File "/Users/alessandro/Projects/pdf/eko/.venv/lib/python3.11/site-packages/numba/cpython/numbers.py", line 1100, in complex_div
    raise ZeroDivisionError("complex division by zero")
ZeroDivisionError: complex division by zero

and it's getting stuck.

But I've not been able to reproduce the segfault yet.

@felixhekhorn
Copy link
Contributor

  • (the so-called 1B$ mistake)

if you have suggestions for improvements, you're welcome 🙃 however, communicating between languages using a third-party library will always be a hassle (by the way: the Rust quad_vec will have to support such a crazy=real-life use case)

I'm actually passing on Linux

good! which makes me confident that the code is not completely crazy and ...

However, I fear it could somehow be a Mac-related problem

I'm afraid so as well - which makes it hard for me to debug 🙈 it is very strange that you get a different error and more over a different error type, i.e. something that hints at a math problem ... (or maybe it is the same and you read some random numbers from somewhere?)

and it's getting stuck.

as I fell myself into that trap: remember to debug with a single core

@alecandido
Copy link
Member

if you have suggestions for improvements, you're welcome 🙃 however, communicating between languages using a third-party library will always be a hassle (by the way: the Rust quad_vec will have to support such a crazy=real-life use case)

Well, consider that JSON is capable to represent all the types that you'd like to have (strings, integers, floats, together with composition as unbounded lists and maps) is perfectly represented in the Rust type system, without the need of escaping to a void* hack. Otherwise, a general JSON loader would be impossible (though you may want to know your type at compile time anyhow, if possible: that's the magic of generics).

@alecandido
Copy link
Member

I'm afraid so as well - which makes it hard for me to debug 🙈 it is very strange that you get a different error and more over a different error type, i.e. something that hints at a math problem ... (or maybe it is the same and you read some random numbers from somewhere?)

Yeah, when it becomes platform-dependent is immediately more complex, not only because of the availability of devices (that's the main motivation behind my laptop, but not sufficient to solve problems in a snap...).
Unfortunately, just trying here might not be enough, and we may have to start diving into the debugger, finding an efficient way to cross the language boundary[*] (possibly, with two different debug sessions at hand). I've never done something like that: I often try to decouple the native library and the Python run with a thin wrapper - but here the wrapper itself can not be thin. But we're not alone (NumPy and Python itself should be debugging in the same conditions), so, there should be a proper way (we could try to read their developers' guides).

[*]: unless we come with a clever intuition (if you know the solution, you can always just apply it)

as I fell myself into that trap: remember to debug with a single core

I had this suspect as well, but unfortunately I tried, changing the value in the lha_paper_bench.py runner:

benchmarks/lha_paper_bench.py --- Python
118 118                     "mugrid": [100],
119 119                     "ev_op_iterations": 10,
120 120                     "interpolation_xgrid": lambertgrid(60).tolist(),
... 121                     "n_integration_cores": 1,
121 122                 }
122 123             ],
123 124             ["ToyLH"],

@felixhekhorn
Copy link
Contributor

Well, consider that JSON is capable to represent all the types that you'd like to have

just to say that it is slightly more complicated 🙃

  1. we are also passing function pointers around which do not fit inside JSON
  2. we are passing through scipy, which - I think - always goes through void * (what else should you do for a general purpose library?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed question Further information is requested rust Rust extension related
Projects
None yet
Development

No branches or pull requests

3 participants