-
-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seg Fault on malloc failure/out of memory #117
Comments
Okay, I think I understand what the problem is. The failure happens inside raw_hash_set::resize(). capacity_ is set to new_capacity on phmap.h line 1925. Then initialize_slots() is called. The call to Allocate() on line 1894 throws std::bad_alloc, which means that the line that sets ctrl_ to the new allocated size on line 1895 never gets run. Then when the call stack is unwound due to the std::bad_alloc exception, destroy_slots() is called on this object. The iteration over capacity_ on line 1904 is using the newer, larger capacity_ value. However, ctrl_ has not be re-allocated to a larger size. This iteration indexes off the end of valid memory for ctrl_[i] and reads garbage, which may result in a call to PolicyTraits::destroy() for a position of slots_ that's off the end of the memory allocation. This results in free() being called on invalid memory and the seg fault. It seems like a correct fix is to pass new_capacity into initialize_slots() and only update capacity_ after the allocations have succeeded. Or maybe ctrl_ should be updated and reset_ctrl() called before the call to Allocate(). Or maybe this can still fail if MakeLayout() can throw std::bad_alloc? I'm going to experiment to see if I can find a simple fix. |
Cool, thanks for investigating this @fegennari . Do you have a suggestion to address this issue? If you submit a PR I'll be happy to merge it in. |
Oh just read the rest of your comment. Thanks for looking into it further! |
I tried to set capacity_ inside initialize_slots(), but that still seg faults. I think many of the member functions use that capacity_ field so I'm not sure where this should be set. The goal is to leave all of the member variables in some consistent state if Allocate() throws so that destroy_slots() doesn't access anything invalid or leak memory. I'm not quite sure I understand exactly what all of these different pieces do and what depends on what. I don't think reset_ctrl() can be called after setting ctrl_ but before setting slots_ because it operates on both variables. I'm not sure what else in the code uses these functions so I would prefer not to randomly change them. |
I think my capacity_ change fixed that failure, and I'm seeing a different failure mode. I'm just running this in a loop trying to make it fail because it seems like failures are somewhat rare. The new case throws std::bad_alloc for an insert() call when allocating/copying the std::string in emplace_decomposable() around line 1800. I'm not entirely sure what's going wrong. Maybe find_or_prepare_insert() is making space for the data but it's not getting written because the emplace_at() call fails, and then the data is again garbage inside the destroy_slots() call where it again fails. It seems quite difficult to guarantee this all works correctly. |
I think the new failure is the same type of problem. prepare_insert() calls set_ctrl() on the element to "enable" it before the data has been copied. Then when the allocation fails the slot data is invalid. This is okay for simple POD data but not a shared_ptr or a vector. If destroy_slots() sees that ctrl_ bit has been set then it tries the delete the object even if it's invalid. The general problem here is that it's only legal to set these ctrl_ bits after the object has successfully been copied. I'm not sure how many places in the code this can happen or how to go about fixing them all cleanly. One more thing. In the raw_hash_set constructor that takes a bucket_cnt on phmap.h line 985, it seems like the reset_growth_left() call isn't needed because this is done in initialize_slots(). Removing this makes it easier to pass the return value of NormalizeCapacity(bucket_cnt) directly into initialize_slots() to fix the first issue. Removing this call appears to work, though I'm not sure any of my tests use that constructor. |
Hi @fegennari , please try the new version I just pushed in. You are right that the issue was updating capacity_ before the memory allocation, leaving an inconsistent state if the memory allocation throwed. So I delayed changing it until after the allocation. Thank you so much for the thorough investigation! |
Thanks! That appears to have fixed the original failure, but I still see the failure related to prepare_insert(). I'm less confident I understand that problem. See my previous two comments. |
Yes, I think you understand the problem and explained it well. I see the issue, but I'm not quite sure how to fix it cleanly. Interestingly the google's Abseil-cpp hash maps from which phmap is derived have the same issue. |
Sorry, I reverted my last change which did not seem to work, even though it did when I tested locally. I'll think it over. |
Thanks for trying but you're right, it doesn't work: |
I found one somewhat hacky way of fixing it. I replaced the construct() call in emplace_at() with a try/catch block like this: This catches std::bad_alloc and marks the slot as empty so that the destructor is never called on it. I don't think this works in general due to various possible issues:
But this does appear to fix the problem. I can run this in a loop with various values of ulimit and I've never seen it seg fault. I'm not sure if there are other similar control flow paths that still fail in classes/member functions that I'm not testing. For example, I haven't tried to figure out what lazy_emplace_at() does and if that flow needs some similar fix. If you have a better fix I can try it out. [Background on why I need this: We have a job system that runs various tasks from a higher level script to create in-memory databases for analysis. The user specifies a memory limit for each one to avoid swapping, etc. When the limit is hit, the process returns a special error return code that the job system can use to skip failed jobs or reschedule with different options to reduce memory usage. Seg fault does not result in the expected return code and tends to break the top-level flow.] |
Yes, I did think of doing that, and it is definitely a valid solution, but phmap doesn't throw any exception right now, and I think some users rely on this so I don't want to introduce it. |
Your use case is perfectly valid, and I do need to fix this! |
Yes my solution isn't the best one for all users. Our codebase uses exceptions everywhere already, which is why I'm okay with this. I wanted to put the fix directly in emplace_at() rather than having to touch all of the callers of that function. If you can find a more general solution that works, I'll take your changes instead. Thanks. |
when constructing an object to be inserted throws std::bad_alloc, the slot was mark as used (even though the object was not properly constructed), so eventually the destructor of a non-initialized object was called causing a segfault. Solution: mark the slot used only after the object is successfully constructed.
Better late than never. I think the issue should be fixed now. |
Great, thanks! I haven't run into that problem since I posted the original issue, so it doesn't seem too common. I'll see if I can find my original test case and check to see if the new version of phmap fixes it. |
I would really appreciate if you can find the test case and try it, thanks! Maybe you haven't seen it again because you use your modification of phmap described above to avoid it? |
I'm not sure what I currently have the code doing. I'll check when I'm at work on Monday. Thanks. |
Cool, thanks. You did mention that earlier. I found one somewhat hacky way of fixing it. I replaced the construct() call in emplace_at() with a try/catch block like this:
If you still use that, it should not be necessary anymore with the current repo version. |
Yes, it looks like I still had that try/catch wrapper fix. I updated to the latest parallel_hashmap and overwrote my changes, and it appears to work now. I don't remember exactly what my previous failing test case was, but the simple string insert failure I included in my initial issue report is working now. I also ran some other tests that worked. It looks like the problem is fixed. Thanks! |
Awesome, thanks a lot for checking it out @fegennari! |
I find that flat_hash_map will sometimes cause a seg fault when the memory limit of the process is reached and malloc() returns null. The seg fault is on the free() call made on the nested container in the map called from the flat_hash_map destructor. The following block of code will consistently seg fault when inserted into my larger project if I use "ulimit -v" to set a process memory limit:
phmap::flat_hash_map< size_t, string > hm;
for( size_t i = 0; i < 1000000000; ++i ) { hm.insert( std::make_pair( i, "sfaasdfgasdgdfgsdhfghfgsdgfdgsdfgsdfgdfgfgsfdg" ) ); }
However, this code appears to work fine and throw std::bad_alloc when I put these lines in a standalone binary. So it has something to do with the context or state of the program at the time when this code is run.
A few more notes:
The stack trace starting at the flat_hash_map destructor is:
#0 0x000000000173776d in __exchange_and_add (__val=-1, __mem=0xfffffffffffffff8) at /grid/common/pkgsData/gcc-v9.3.0p1/Linux/RHEL6.0-2013-x86_64/include/c++/9.3.0/ext/atomicity.h:82
#1 __exchange_and_add_dispatch (__val=-1, __mem=0xfffffffffffffff8) at /grid/common/pkgsData/gcc-v9.3.0p1/Linux/RHEL6.0-2013-x86_64/include/c++/9.3.0/ext/atomicity.h:82
#2 __exchange_and_add_dispatch (__val=-1, __mem=0xfffffffffffffff8) at /grid/common/pkgsData/gcc-v9.3.0p1/Linux/RHEL6.0-2013-x86_64/include/c++/9.3.0/ext/atomicity.h:78
#3 _M_dispose (__a=..., this=0xffffffffffffffe8) at /grid/common/pkgsData/gcc-v9.3.0p1/Linux/RHEL6.0-2013-x86_64/include/c++/9.3.0/bits/basic_string.h:3311
#4 _M_dispose (__a=..., this=0xffffffffffffffe8) at /grid/common/pkgsData/gcc-v9.3.0p1/Linux/RHEL6.0-2013-x86_64/include/c++/9.3.0/bits/basic_string.h:3295
#5 ~basic_string (this=0x7fffef1a0028, __in_chrg=) at /grid/common/pkgsData/gcc-v9.3.0p1/Linux/RHEL6.0-2013-x86_64/include/c++/9.3.0/bits/basic_string.h:3706
#6 ~pair (this=0x7fffef1a0020, __in_chrg=) at /grid/common/pkgsData/gcc-v9.3.0p1/Linux/RHEL6.0-2013-x86_64/include/c++/9.3.0/bits/stl_pair.h:208
#7 destroy<std::pair<unsigned long, std::basic_string > > (this=0x7fffffffc648, __p=0x7fffef1a0020)
at /grid/common/pkgsData/gcc-v9.3.0p1/Linux/RHEL6.0-2013-x86_64/include/c++/9.3.0/ext/new_allocator.h:153
#8 destroy<std::pair<unsigned long, std::basic_string > > (__a=..., __p=0x7fffef1a0020)
at /grid/common/pkgsData/gcc-v9.3.0p1/Linux/RHEL6.0-2013-x86_64/include/c++/9.3.0/bits/alloc_traits.h:497
#9 destroy_impl<std::allocator<std::pair<unsigned long const, std::basic_string > >, std::pair<unsigned long, std::basic_string > > (a=..., p=0x7fffef1a0020)
at ../src/parallel_hashmap/phmap_base.h:1542
#10 destroy<std::pair<unsigned long, std::basic_string > > (a=..., p=0x7fffef1a0020) at ../src/parallel_hashmap/phmap_base.h:1500
#11 destroy<std::allocator<std::pair<unsigned long const, std::basic_string > > > (alloc=0x7fffffffc648, slot=0x7fffef1a0020) at ../src/parallel_hashmap/phmap_base.h:4695
#12 destroy<std::allocator<std::pair<unsigned long const, std::basic_string > > > (alloc=0x7fffffffc648, slot=0x7fffef1a0020) at ../src/parallel_hashmap/phmap.h:3996
#13 destroy<std::allocator<std::pair<unsigned long const, std::basic_string > > > (alloc=0x7fffffffc648, slot=0x7fffef1a0020) at ../src/parallel_hashmap/phmap_base.h:479
#14 phmap::priv::raw_hash_set<phmap::priv::FlatHashMapPolicy<unsigned long, std::string>, phmap::Hash, phmap::EqualTo, std::allocator<std::pair<unsigned long const, std::string> > >::destroy_slots (this=0x7fffffffc620) at ../src/parallel_hashmap/phmap.h:1906
#15 0x0000000000c34eaf in ~raw_hash_set (this=0x7fffffffc620, __in_chrg=) at ../src/parallel_hashmap/phmap.h:4397
#16 ~raw_hash_map (this=0x7fffffffc620, __in_chrg=) at ../src/parallel_hashmap/phmap.h:2228
#17 ~flat_hash_map (this=0x7fffffffc620, __in_chrg=) at ../src/parallel_hashmap/phmap.h:4397
The text was updated successfully, but these errors were encountered: