[Discussion] Overhead in MXNet Execution #14883

szha · 2019-05-05T23:39:30Z

Hi,

I'm starting this thread to collect issues that are likely caused by various overhead in MXNet execution so that we can focus the efforts in solving them. The focus is everything other than the operators and kvstores.

If you've seen performance issues that you found were due to the overhead in MXNet execution (e.g. ctypes/FFI call, asynchronous execution, graph construction overhead, memory release and clean-up), please either report it in comments or link relevant issue.

If you're unsure, please create a performance issue report in another github issue instead.

Be sure to include these details:

the performance issue.
what you tried to solve it.
timings you've collected if any.

rvardimon · 2019-06-04T17:05:53Z

#15148

Jerryzcn · 2019-08-30T05:11:18Z

MXNet function call overhead is quite high
NumPy call overhead: 0.7 microsecond
TVM call overhead: 3.4 microsecond
MXNet call overhead: 28.3 microsecond

Code attached for benchmarking the function call overhead.

import numpy as np
from matplotlib import pyplot as plt
from IPython import display

def benchmark(func, n_start, n_end, n_stride=1):
    avg_times, sizes = [], (2**np.arange(n_start, n_end, n_stride))
    np.random.seed(0)
    for size in sizes:
        avg_times.append(func(size))
    return sizes, np.array(avg_times)

def np_copy(size):
    x = np.random.normal(size=size).astype('float32')
    y = np.empty_like(x)
    res = %timeit -o -q np.copyto(y, x)
    return res.average

_, times = benchmark(np_copy, 1, 8)
print('NumPy call overhead: %.1f microsecond'% (times.mean()*1e6,))

import tvm

def tvm_copy(size):
    x = np.random.normal(size=size).astype('float32')
    y = np.empty_like(x)
    x, y = tvm.nd.array(x), tvm.nd.array(y)
    res = %timeit -o -q x.copyto(y)
    return res.average

_, times = benchmark(tvm_copy, 1, 8)
print('TVM call overhead: %.1f microsecond'% (times.mean()*1e6,))

import mxnet as mx

def mx_copy(size):
    x = np.random.normal(size=size).astype('float32')
    y = np.empty_like(x)
    x, y = mx.nd.array(x), mx.nd.array(y)
    res = %timeit -o -q x.copyto(y)
    return res.average

_, times = benchmark(mx_copy, 1, 8)
print('MXNet call overhead: %.1f microsecond'% (times.mean()*1e6,))

Jerryzcn · 2019-08-30T05:26:08Z

issue also reported here: #8112

sxjscience · 2019-09-04T05:33:21Z

Some profiling results:
https://gist.github.com/sxjscience/f069d49b250820be4483debfa9a8d4b5

Here I used NaiveEngine

MXNet: 25.77195848737444
JAX: 58.22208949497768
Numpy: 1.0632106236049108
PyTorch: 13.021196637834823
TensorFlow: 21.734169551304408

sxjscience · 2019-09-09T04:13:15Z

The overhead is 33.87945038931711 after we used the default Engine (comment NaiveEngine line)

sxjscience · 2019-09-11T18:28:26Z

In P3.2x:

With NaiveEngine:

MXNet: 49.990619931902195
JAX: 132.80538150242398
TVM: 4.161664417811803
Numpy: 1.8179416656494143
PyTorch: 20.977394921439036
TensorFlow: 47.540392194475444

With ThreadedEngine

MXNet: 61.272689274379196
JAX: 141.04713712419783
TVM: 4.231657300676618
Numpy: 1.918860844203404
PyTorch: 20.21517072405134
TensorFlow: 44.17456899370466

szha · 2019-11-27T02:10:57Z

One idea that gained some popularity after discussion is to introduce an engine-less mode to MXNet, in which the operators are exposed in API and dispatched in a similar way as pytorch.

Given that Naive Engine option should be quite close to this already, we should verify the overhead in the Naive Engine mode and judge the necessity based on that result. Since the target is to measure the overhead, we will want to control the performance difference in operators.

Kh4L · 2019-11-27T10:58:38Z

@sxjscience what do the differences between Naive and ThreadedEngine reflect in the other frameworks?

eric-haibin-lin · 2019-12-19T07:03:35Z

proposal in #17097

tqchen · 2019-12-19T23:15:44Z

i just would like to comment a bit about the tvm related benchmark. The copyto function was not really a good benchmark for TVM's FFI cost mainly because that function was implemented directly via ctypes because we didn't find it to be a bottleneck. This functionality could have been turned into cython(which is used to support the core functionality), which could speedup the overhead of the call.

The current best way test the FFI is through Packed function calls. PackedFunc is the place where most of the API functions are exposed, here I cross posted a quick benchmark on #17097

Note that it is important to compile TVM with Cython by typing make cython3 in the root folder. TVM by default uses cython if it is available, but we use the TVM_FFI env variable to make sure it is the case.

import timeit

setup = """
import tvm
x = tvm.nd.array([0])
y = tvm.nd.array([1])
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
                     stmt='nop(x, y)')
timer.timeit(1)
num_repeat = 1000
print(timer.timeit(num_repeat) / num_repeat)

Results on my laptop(macbook pro 13inch)

$ TVM_FFI=cython python3 test.py
9.749500000000299e-08
$ TVM_FFI=ctypes python3 test.py
7.226789000000011e-06

tqchen · 2019-12-19T23:45:34Z

To followup this again, the patch apache/tvm#4549 fixed the copyto issue by cythonize it. Here is an updated script to add benchmark of copyto

import timeit

setup = """
import tvm
x = tvm.nd.array([0])
y = tvm.nd.array([1])
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
                     stmt='nop(1, 2)')
timer.timeit(1)
num_repeat = 1000
print("TVM-nop:", timer.timeit(num_repeat) / num_repeat)


setup = """
import numpy as np
import tvm
x = np.random.normal(size=1).astype('float32')
y = np.empty_like(x)
"""

timer = timeit.Timer(setup=setup,
                     stmt='np.copyto(y, x)')
timer.timeit(1)
print("numpy.copyto:", timer.timeit(num_repeat) / num_repeat)


setup = """
import numpy as np
import tvm
x = np.random.normal(size=1).astype('float32')
y = np.empty_like(x)
x = tvm.nd.array(x)
y = tvm.nd.array(y)
"""

timer = timeit.Timer(setup=setup,
                     stmt='x.copyto(y)')
timer.timeit(1)
print("tvm.copyto:", timer.timeit(num_repeat) / num_repeat)

Results on my laptop(macbook pro 13inch)

$TVM_FFI=ctypes python test.py
TVM-nop: 5.864649e-06
numpy.copyto: 3.7303100000002364e-07
tvm.copyto: 8.198690000000008e-07

$ TVM_FFI=cython python test.py
TVM-nop: 1.1101499999999209e-07
numpy.copyto: 2.519029999999978e-07
tvm.copyto: 2.3881700000000226e-07

This comment has been minimized.

Sign in to view

szha added Discussion Performance labels May 5, 2019

szha mentioned this issue Sep 13, 2019

[RFC] Apache MXNet 2.0 Roadmap #16167

Open

junrushao mentioned this issue Dec 20, 2019

[RFC][mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead #17097

Open

anko-intel mentioned this issue May 25, 2020

DOT product too slow on CPU and GPU compared to np and pytorch #17971

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] Overhead in MXNet Execution #14883

[Discussion] Overhead in MXNet Execution #14883

szha commented May 5, 2019

This comment has been minimized.

rvardimon commented Jun 4, 2019

Jerryzcn commented Aug 30, 2019 •

edited

Loading

Jerryzcn commented Aug 30, 2019 •

edited

Loading

sxjscience commented Sep 4, 2019

sxjscience commented Sep 9, 2019

sxjscience commented Sep 11, 2019 •

edited

Loading

szha commented Nov 27, 2019 •

edited

Loading

Kh4L commented Nov 27, 2019

eric-haibin-lin commented Dec 19, 2019

tqchen commented Dec 19, 2019 •

edited

Loading

tqchen commented Dec 19, 2019 •

edited

Loading

[Discussion] Overhead in MXNet Execution #14883

[Discussion] Overhead in MXNet Execution #14883

Comments

szha commented May 5, 2019

This comment has been minimized.

rvardimon commented Jun 4, 2019

Jerryzcn commented Aug 30, 2019 • edited Loading

Jerryzcn commented Aug 30, 2019 • edited Loading

sxjscience commented Sep 4, 2019

sxjscience commented Sep 9, 2019

sxjscience commented Sep 11, 2019 • edited Loading

szha commented Nov 27, 2019 • edited Loading

Kh4L commented Nov 27, 2019

eric-haibin-lin commented Dec 19, 2019

tqchen commented Dec 19, 2019 • edited Loading

tqchen commented Dec 19, 2019 • edited Loading

Jerryzcn commented Aug 30, 2019 •

edited

Loading

Jerryzcn commented Aug 30, 2019 •

edited

Loading

sxjscience commented Sep 11, 2019 •

edited

Loading

szha commented Nov 27, 2019 •

edited

Loading

tqchen commented Dec 19, 2019 •

edited

Loading

tqchen commented Dec 19, 2019 •

edited

Loading