[RFC] Large MXNet source files causing CI build failures #19688

mseth10 · 2020-12-17T02:56:54Z

Problem statement

MXNet CI is running OOM [1] while building MXNet binaries for unix-cpu and unix-gpu stages. This is an intermittent failure and the work around is to re-trigger CI a few times. The issue is caused due to some of the numpy .cc files being too large causing gcc to use too much memory. The issue was not pronounced with gcc7, but with the recent update to use gcc8 [2] for CI builds, we have started to see this OOM error.

The fix is to refactor the numpy .cc files into smaller files so that the objects created during compilation don't use much memory. Here is the list of the largest objects (>10MB in size) generated currently on Mac CPU build:

 11M	./operator/numpy/linalg/np_norm_backward.cc.o
 11M	./operator/numpy/np_kron.cc.o
 11M	./operator/numpy/random/np_location_scale_op.cc.o
 12M	./operator/numpy/np_insert_op_slice.cc.o
 12M	./operator/numpy/np_insert_op_tensor.cc.o
 13M	./operator/numpy/np_elemwise_broadcast_op_extended_sec.cc.o
 13M	./operator/numpy/np_elemwise_unary_op_basic.cc.o
 13M	./operator/numpy/np_percentile_op.cc.o
 14M	./operator/numpy/np_matrix_op.cc.o
 14M	./operator/numpy/np_moments_op.cc.o
 14M	./operator/numpy/np_where_op.cc.o
 15M	./operator/numpy/np_einsum_op.cc.o
 16M	./operator/numpy/np_elemwise_broadcast_op_extended.cc.o
 21M	./operator/numpy/np_broadcast_reduce_op_value.cc.o
 22M	./operator/numpy/linalg/np_norm_forward.cc.o
 24M	./operator/numpy/np_elemwise_broadcast_op.cc.o
 34M	./operator/numpy/np_elemwise_broadcast_logic_op.cc.o

The corresponding cc files to above objects contains more than 210 operator registrations, and to refactor those into smaller files will need a considerable time and effort from the community. With 5 operators per day, that's more than 40 days of developers effort.

Proposed solutions

Option 1: We keep using gcc8 for CI builds and start working on refactoring these numpy .cc files. This would mean the community will have to face the CI failures for 40 days (could be less if more community members contribute).

Option 2: We go back to using gcc7 for CI builds, potentially solving the CI problem immediately, while we work on refactoring the numpy files. Reverting to gcc7 would take 2 days and then refactoring would take another 40 days.

I personally would prefer Option 2 for the reason that it saves contributors time in getting their PRs merged quickly, as well as saves on the CI resources. Would like to request community feedback on the same.

Going forward we also need to add a check to MXNet CI for build time memory usage. Any ideas for the same would be highly appreciated.

References

[1] CI failure: https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-19588/12/pipeline/
[2] CI builds gcc update commit: afc76b0
Related issues out of memory during compilation on CI #19623 gcc8+ memory usage regression for compiling indexing_op.o #18501

The text was updated successfully, but these errors were encountered:

wkcn · 2020-12-17T03:19:35Z

There is a related issue: #18501
I also prefer Option 2 to solve the CI problem immediately : )

access2rohit · 2020-12-17T06:07:12Z

Thank you @mseth10 for looking into this !!
I think Option 2 helps us unblock CI thus unblocking all open source contributors immediately and gives us more time to take the corrective measure to refactor files to reduce memory utilization by gcc.

Option1 may keep us blocked for long and has the uncertainty of how long will it take.

IMHO Option 2 is the way forward :)
@leezu wdyt ?

Zha0q1 · 2020-12-17T06:08:43Z

+1 for option 2

kshitij12345 · 2020-12-17T06:16:07Z

A tangential question:

Should we raise a bug report with gcc community as it seems that gcc8 is using more resources to handle the same files than gcc7 and it is a regression between those version.

leezu · 2020-12-17T15:43:52Z

How do you arrive at the time estimates? Adding a CC=gcc-7 statement in runtime-functions.sh shouldn't take you 2 days. It should take less than 30 minutes... I'm also a bit worried about a time estimate of 40 days for splitting .cc files into multiple files.

samskalicky · 2020-12-17T19:05:51Z

How do you arrive at the time estimates? Adding a CC=gcc-7 statement in runtime-functions.sh shouldn't take you 2 days. It should take less than 30 minutes... I'm also a bit worried about a time estimate of 40 days for splitting .cc files into multiple files.

Effort will certainly vary depending on who does the work, their familiarity, and how long it takes them to ramp up. If we get people familiar with the code it could be shorter. If we get people unfamiliar, 5/day seems reasonable. If its someone new to MXNet (and possibly not an expert in C++) it could take longer.

Until we get actual people to volunteer to do the work, and ask them how long they will need to do the work, we cant really give hard estimates. So these will have to do for now.

kpuatamazon · 2020-12-21T16:08:25Z

A related problem is excessive code generation. Take np.delete for example.

https://github.com/apache/incubator-mxnet/blob/16e2b15f6e334ca88f29b9c14e55547df2c136fc/src/operator/numpy/np_delete_op-inl.h#L337-L355

That's:

MSHADOW_TYPE_SWITCH: 8 types on CPU and 7 types on GPU.
MXNET_NDIM_SWITCH cases 1 through 5.
MSHADOW_TYPE_SWITCH: 8 types on CPU and 7 types on GPU.
MXNET_ASSIGN_REQ_SWITCH: 2 cases

That's 8 * 5 * 8 * 2 = 640 ways on CPU and 7 * 5 * 7 * 2 = 490 ways on GPU.

This problem operates on a single axis. It reduces to: size of outer loop (i.e. the product of dimensions before the axis), the size of the axis in question, and the size of the data after the axis (i.e. the product of dimensions after the axis). After this simplification, there's no ndim dispatch. Supports arbitrary dimensionality with a factor of 5 reduction in compilation to 128 cases.

In the common case where the types are the same and output is kWriteTo, a loop over memory copies is much faster. If we're just copying PODs, then the size of the data type can be folded into the size of the data to copy. So all 8 cases of identical input and output types with kWriteTo can be folded into one compilation, reducing 8 on CPU or 7 on GPU to 1.

On CPU there are 121 cases: one for the normal copying operation and 120 for some combination of type conversion and/or kAddTo. On GPU there are 91 cases.

To avoid build failures due to large source files. See #19688

mseth10 added the RFC Post requesting for comments label Dec 17, 2020

mseth10 mentioned this issue Dec 21, 2020

use CC=gcc-7 CXX=g++-7 for all unix CI builds #19701

Merged

bgawrych mentioned this issue Apr 22, 2021

[FEATURE] Add int16 support. #20104

Open

6 tasks

barry-jin mentioned this issue Sep 14, 2021

[2.0] Split np_elemwise_broadcast_logic_op.cc #20580

Merged

leezu pushed a commit that referenced this issue Sep 20, 2021

Split np_elemwise_broadcast_logic_op.cc (#20580)

179e7db

To avoid build failures due to large source files. See #19688

barry-jin mentioned this issue Sep 23, 2021

[v2.0] Split Large Source Files #20604

Merged

4 tasks

mozga-intel mentioned this issue Oct 28, 2021

[v2.0] Split Large Source norm file #20711

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Large MXNet source files causing CI build failures #19688

[RFC] Large MXNet source files causing CI build failures #19688

mseth10 commented Dec 17, 2020 •

edited

Loading

wkcn commented Dec 17, 2020

access2rohit commented Dec 17, 2020 •

edited

Loading

Zha0q1 commented Dec 17, 2020

kshitij12345 commented Dec 17, 2020

leezu commented Dec 17, 2020

samskalicky commented Dec 17, 2020

kpuatamazon commented Dec 21, 2020 •

edited

Loading

[RFC] Large MXNet source files causing CI build failures #19688

[RFC] Large MXNet source files causing CI build failures #19688

Comments

mseth10 commented Dec 17, 2020 • edited Loading

Problem statement

Proposed solutions

References

wkcn commented Dec 17, 2020

access2rohit commented Dec 17, 2020 • edited Loading

Zha0q1 commented Dec 17, 2020

kshitij12345 commented Dec 17, 2020

leezu commented Dec 17, 2020

samskalicky commented Dec 17, 2020

kpuatamazon commented Dec 21, 2020 • edited Loading

mseth10 commented Dec 17, 2020 •

edited

Loading

access2rohit commented Dec 17, 2020 •

edited

Loading

kpuatamazon commented Dec 21, 2020 •

edited

Loading