Excessive memory allocation without static_alloc #12116

safrooze · 2018-08-10T01:10:49Z

Description

The change in #11951 that fixes nested call on CachedOp causes excessive memory allocation when hybridize() is called with default static_alloc=False. In my specific case, the memory allocation grows from 1.5GB to over 10GB.

Environment info (Required)

----------Python Info----------
Version      : 3.4.5
Compiler     : GCC 4.4.7 20120313 (Red Hat 4.4.7-1)
Build        : ('default', 'Jul  2 2016 17:47:47')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
Version      : 18.0
Directory    : /home/ec2-user/anaconda3/envs/mxnet_p34/lib/python3.4/site-packages/pip
----------MXNet Info-----------
Version      : 1.3.0
Directory    : /home/ec2-user/src/mxnet/python/mxnet
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform     : Linux-4.9.93-41.60.amzn1.x86_64-x86_64-with-glibc2.2.5
system       : Linux
node         : ip-172-31-73-235
release      : 4.9.93-41.60.amzn1.x86_64
version      : #1 SMP Fri Apr 13 21:58:27 UTC 2018
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:              1
CPU MHz:               2699.945
BogoMIPS:              4600.11
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-7
----------Network Test----------
Setting timeout: 10
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0023 sec, LOAD: 0.0982 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1248 sec, LOAD: 0.4074 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0031 sec, LOAD: 0.1043 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.1631 sec, LOAD: 0.4245 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0184 sec, LOAD: 0.5672 sec.
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0014 sec, LOAD: 0.4493 sec.

I'm using Python package

Build info (Required if built from source)

This is reproducible through package from PyPI: pip install --pre -U mxnet-cu90mkl==1.3.0b20180801. Installing the previous version (1.3.0b20180726) doesn't have this problem.

I also confirmed that the exact commit in #11951 resulted in this regression by building from source before and at this node.

Compiler: gcc
MXNet commit hash - BROKEN: ed20304
MXNet commit hash - GOOD: 98a41af

Build config:
(Paste the content of config.mk, or the build command.)

Minimum reproducible example

Don't have one yet. Might be able to come up with one if it's necessary.

Steps to reproduce

build hybrid network.
call hybridize()
check memory usage using nvidida-smi
Repeat without calling hybridize() or by calling hybridize(static_alloc=True) and problem goes away.

What have you tried to solve it?

Nothing yet.

The text was updated successfully, but these errors were encountered:

safrooze · 2018-08-10T01:11:25Z

@zheng-da @szha

KellenSunderland · 2018-08-10T13:30:49Z

I ran the follow code:

import mxnet as mx
import gluoncv
from time import sleep

net = gluoncv.model_zoo.get_model('cifar_resnet20_v1', pretrained=True, ctx=mx.gpu(0))
net.hybridize()
sleep(10)

With commit ed20304 I observed a GPU memory usage of 2.5GB. Re-running the same test with commit 98a41af I observe a memory usage of 2.5GB

safrooze · 2018-08-10T16:45:42Z

@KellenSunderland I should have been more specific. The memory increase happens during inference and increases by each forward call and eventually stabilizes. Haven't had a chance to create a minimum reproducible example yet.

safrooze · 2018-08-10T17:05:36Z

@piiswrong Can you take a look?

lanking520 · 2018-08-13T05:30:34Z

@mxnet-label-bot please label this as [operator, memory]

srochel · 2018-08-14T16:00:30Z

Kellen - can you please provide more details - I don't understand your message:
With commit ed20304 I observed a GPU memory usage of 2.5GB. Re-running the same test with commit 98a41af I observe a memory usage of 2.5GB
What is the difference and how can one reproduce?

KellenSunderland · 2018-08-14T18:58:10Z

I basically meant to say that I was unable to reproduce the problem. Sounds like there were some steps required that I didn't attempt (run forward in a loop).

…

On Tue, Aug 14, 2018, 6:01 PM Steffen Rochel ***@***.***> wrote: Kellen - can you please provide more details - I don't understand your message: With commit ed20304 <ed20304> I observed a GPU memory *usage of 2.5GB*. Re-running the same test with commit 98a41af <98a41af> I observe a memory *usage of 2.5GB* What is the difference and how can one reproduce? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#12116 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHGTE1Q-imnP3W14hTVDfBo4_O1_viTHks5uQvRXgaJpZM4V3b4y> .

piiswrong · 2018-08-14T22:36:20Z

Can you demonstrate a case where it actually fails with OOM?
We have a memory pool that caches freed memory. So the memory usage you see in nvidia-smi may not be the same with actual usage.

szha · 2018-08-14T23:46:20Z

You can do export MXNET_GPU_MEM_POOL_RESERVE=100 to disable the memory pool.

safrooze · 2018-08-14T23:57:51Z

The network does indeed run out of memory with larger loop count.

mxnet.base.MXNetError: [23:50:33] src/storage/./pooled_storage_manager.h:119: cudaMalloc failed: out of memory

I also tried setting MXNET_GPU_MEM_POOL_RESERVE=100 and that has an interesting behavior: The peak memory usage doesn't change, but at the point where typically memory would stabilize, it resets back to ~4GB and climbs back up and again resets back and continues this pattern. Needless to say, the performance is also a lot slower (~4x) because of continuous mem allocations.

I should mention that inference is composed of two hybridized networks. For each inference instance, the first network is called once and then the next network is called several times with fixed input shapes. The peak memory usage is a function of number of times the second network is called (i.e. the number of loop iterations). Without setting MXNET_GPU_MEM_POOL_RESERVE, if the network doesn't run out of memory for each inference instance, the memory utilization (i.e. buffer pool size) stabilizes and stays constant for subsequent inference runs.

Roshrini · 2018-08-22T16:45:22Z

@safrooze Can we close this issue now as the PR was merged?

marcoabreu added Memory Operator labels Aug 13, 2018

zheng-da mentioned this issue Aug 15, 2018

Fix a bug in CachedOP. #12184

Merged

7 tasks

safrooze closed this as completed Aug 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive memory allocation without static_alloc #12116

Excessive memory allocation without static_alloc #12116

safrooze commented Aug 10, 2018 •

edited

Loading

safrooze commented Aug 10, 2018

KellenSunderland commented Aug 10, 2018 •

edited

Loading

safrooze commented Aug 10, 2018 •

edited

Loading

safrooze commented Aug 10, 2018

lanking520 commented Aug 13, 2018

srochel commented Aug 14, 2018

KellenSunderland commented Aug 14, 2018 via email

piiswrong commented Aug 14, 2018

szha commented Aug 14, 2018

safrooze commented Aug 14, 2018 •

edited

Loading

Roshrini commented Aug 22, 2018

Excessive memory allocation without static_alloc #12116

Excessive memory allocation without static_alloc #12116

Comments

safrooze commented Aug 10, 2018 • edited Loading

Description

Environment info (Required)

Build info (Required if built from source)

Minimum reproducible example

Steps to reproduce

What have you tried to solve it?

safrooze commented Aug 10, 2018

KellenSunderland commented Aug 10, 2018 • edited Loading

safrooze commented Aug 10, 2018 • edited Loading

safrooze commented Aug 10, 2018

lanking520 commented Aug 13, 2018

srochel commented Aug 14, 2018

KellenSunderland commented Aug 14, 2018 via email

piiswrong commented Aug 14, 2018

szha commented Aug 14, 2018

safrooze commented Aug 14, 2018 • edited Loading

Roshrini commented Aug 22, 2018

safrooze commented Aug 10, 2018 •

edited

Loading

KellenSunderland commented Aug 10, 2018 •

edited

Loading

safrooze commented Aug 10, 2018 •

edited

Loading

safrooze commented Aug 14, 2018 •

edited

Loading