Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seg Fault on malloc failure/out of memory #117

Closed
fegennari opened this issue Nov 17, 2021 · 22 comments
Closed

Seg Fault on malloc failure/out of memory #117

fegennari opened this issue Nov 17, 2021 · 22 comments

Comments

@fegennari
Copy link

I find that flat_hash_map will sometimes cause a seg fault when the memory limit of the process is reached and malloc() returns null. The seg fault is on the free() call made on the nested container in the map called from the flat_hash_map destructor. The following block of code will consistently seg fault when inserted into my larger project if I use "ulimit -v" to set a process memory limit:
phmap::flat_hash_map< size_t, string > hm;
for( size_t i = 0; i < 1000000000; ++i ) { hm.insert( std::make_pair( i, "sfaasdfgasdgdfgsdhfghfgsdgfdgsdfgsdfgdfgfgsfdg" ) ); }
However, this code appears to work fine and throw std::bad_alloc when I put these lines in a standalone binary. So it has something to do with the context or state of the program at the time when this code is run.

A few more notes:

  • This only seems to happen when the map key or value are types with internal allocations such as std::vector or std::string.
  • Changing the hash_map type to unordered_map or spp::sparse_hash_map result in the expected std::bad_alloc.
  • Changing the hash_map type to google::sparse_hash_map results in the error "sparsehash FATAL ERROR: failed to allocate 32 groups" and call to exit().
  • Passing in a custom allocator that checks for a null return from malloc() inside the allocate() call and throws std::bad_alloc fixes the problem.

The stack trace starting at the flat_hash_map destructor is:
#0 0x000000000173776d in __exchange_and_add (__val=-1, __mem=0xfffffffffffffff8) at /grid/common/pkgsData/gcc-v9.3.0p1/Linux/RHEL6.0-2013-x86_64/include/c++/9.3.0/ext/atomicity.h:82
#1 __exchange_and_add_dispatch (__val=-1, __mem=0xfffffffffffffff8) at /grid/common/pkgsData/gcc-v9.3.0p1/Linux/RHEL6.0-2013-x86_64/include/c++/9.3.0/ext/atomicity.h:82
#2 __exchange_and_add_dispatch (__val=-1, __mem=0xfffffffffffffff8) at /grid/common/pkgsData/gcc-v9.3.0p1/Linux/RHEL6.0-2013-x86_64/include/c++/9.3.0/ext/atomicity.h:78
#3 _M_dispose (__a=..., this=0xffffffffffffffe8) at /grid/common/pkgsData/gcc-v9.3.0p1/Linux/RHEL6.0-2013-x86_64/include/c++/9.3.0/bits/basic_string.h:3311
#4 _M_dispose (__a=..., this=0xffffffffffffffe8) at /grid/common/pkgsData/gcc-v9.3.0p1/Linux/RHEL6.0-2013-x86_64/include/c++/9.3.0/bits/basic_string.h:3295
#5 ~basic_string (this=0x7fffef1a0028, __in_chrg=) at /grid/common/pkgsData/gcc-v9.3.0p1/Linux/RHEL6.0-2013-x86_64/include/c++/9.3.0/bits/basic_string.h:3706
#6 ~pair (this=0x7fffef1a0020, __in_chrg=) at /grid/common/pkgsData/gcc-v9.3.0p1/Linux/RHEL6.0-2013-x86_64/include/c++/9.3.0/bits/stl_pair.h:208
#7 destroy<std::pair<unsigned long, std::basic_string > > (this=0x7fffffffc648, __p=0x7fffef1a0020)
at /grid/common/pkgsData/gcc-v9.3.0p1/Linux/RHEL6.0-2013-x86_64/include/c++/9.3.0/ext/new_allocator.h:153
#8 destroy<std::pair<unsigned long, std::basic_string > > (__a=..., __p=0x7fffef1a0020)
at /grid/common/pkgsData/gcc-v9.3.0p1/Linux/RHEL6.0-2013-x86_64/include/c++/9.3.0/bits/alloc_traits.h:497
#9 destroy_impl<std::allocator<std::pair<unsigned long const, std::basic_string > >, std::pair<unsigned long, std::basic_string > > (a=..., p=0x7fffef1a0020)
at ../src/parallel_hashmap/phmap_base.h:1542
#10 destroy<std::pair<unsigned long, std::basic_string > > (a=..., p=0x7fffef1a0020) at ../src/parallel_hashmap/phmap_base.h:1500
#11 destroy<std::allocator<std::pair<unsigned long const, std::basic_string > > > (alloc=0x7fffffffc648, slot=0x7fffef1a0020) at ../src/parallel_hashmap/phmap_base.h:4695
#12 destroy<std::allocator<std::pair<unsigned long const, std::basic_string > > > (alloc=0x7fffffffc648, slot=0x7fffef1a0020) at ../src/parallel_hashmap/phmap.h:3996
#13 destroy<std::allocator<std::pair<unsigned long const, std::basic_string > > > (alloc=0x7fffffffc648, slot=0x7fffef1a0020) at ../src/parallel_hashmap/phmap_base.h:479
#14 phmap::priv::raw_hash_set<phmap::priv::FlatHashMapPolicy<unsigned long, std::string>, phmap::Hash, phmap::EqualTo, std::allocator<std::pair<unsigned long const, std::string> > >::destroy_slots (this=0x7fffffffc620) at ../src/parallel_hashmap/phmap.h:1906
#15 0x0000000000c34eaf in ~raw_hash_set (this=0x7fffffffc620, __in_chrg=) at ../src/parallel_hashmap/phmap.h:4397
#16 ~raw_hash_map (this=0x7fffffffc620, __in_chrg=) at ../src/parallel_hashmap/phmap.h:2228
#17 ~flat_hash_map (this=0x7fffffffc620, __in_chrg=) at ../src/parallel_hashmap/phmap.h:4397

@fegennari
Copy link
Author

fegennari commented Nov 18, 2021

Okay, I think I understand what the problem is. The failure happens inside raw_hash_set::resize(). capacity_ is set to new_capacity on phmap.h line 1925. Then initialize_slots() is called. The call to Allocate() on line 1894 throws std::bad_alloc, which means that the line that sets ctrl_ to the new allocated size on line 1895 never gets run. Then when the call stack is unwound due to the std::bad_alloc exception, destroy_slots() is called on this object. The iteration over capacity_ on line 1904 is using the newer, larger capacity_ value. However, ctrl_ has not be re-allocated to a larger size. This iteration indexes off the end of valid memory for ctrl_[i] and reads garbage, which may result in a call to PolicyTraits::destroy() for a position of slots_ that's off the end of the memory allocation. This results in free() being called on invalid memory and the seg fault.

It seems like a correct fix is to pass new_capacity into initialize_slots() and only update capacity_ after the allocations have succeeded. Or maybe ctrl_ should be updated and reset_ctrl() called before the call to Allocate(). Or maybe this can still fail if MakeLayout() can throw std::bad_alloc? I'm going to experiment to see if I can find a simple fix.

@greg7mdp
Copy link
Owner

Cool, thanks for investigating this @fegennari . Do you have a suggestion to address this issue? If you submit a PR I'll be happy to merge it in.

@greg7mdp
Copy link
Owner

Oh just read the rest of your comment. Thanks for looking into it further!

@fegennari
Copy link
Author

I tried to set capacity_ inside initialize_slots(), but that still seg faults. I think many of the member functions use that capacity_ field so I'm not sure where this should be set. The goal is to leave all of the member variables in some consistent state if Allocate() throws so that destroy_slots() doesn't access anything invalid or leak memory. I'm not quite sure I understand exactly what all of these different pieces do and what depends on what.

I don't think reset_ctrl() can be called after setting ctrl_ but before setting slots_ because it operates on both variables. I'm not sure what else in the code uses these functions so I would prefer not to randomly change them.

@fegennari
Copy link
Author

I think my capacity_ change fixed that failure, and I'm seeing a different failure mode. I'm just running this in a loop trying to make it fail because it seems like failures are somewhat rare. The new case throws std::bad_alloc for an insert() call when allocating/copying the std::string in emplace_decomposable() around line 1800. I'm not entirely sure what's going wrong. Maybe find_or_prepare_insert() is making space for the data but it's not getting written because the emplace_at() call fails, and then the data is again garbage inside the destroy_slots() call where it again fails.

It seems quite difficult to guarantee this all works correctly.

@fegennari
Copy link
Author

I think the new failure is the same type of problem. prepare_insert() calls set_ctrl() on the element to "enable" it before the data has been copied. Then when the allocation fails the slot data is invalid. This is okay for simple POD data but not a shared_ptr or a vector. If destroy_slots() sees that ctrl_ bit has been set then it tries the delete the object even if it's invalid. The general problem here is that it's only legal to set these ctrl_ bits after the object has successfully been copied. I'm not sure how many places in the code this can happen or how to go about fixing them all cleanly.

One more thing. In the raw_hash_set constructor that takes a bucket_cnt on phmap.h line 985, it seems like the reset_growth_left() call isn't needed because this is done in initialize_slots(). Removing this makes it easier to pass the return value of NormalizeCapacity(bucket_cnt) directly into initialize_slots() to fix the first issue. Removing this call appears to work, though I'm not sure any of my tests use that constructor.

greg7mdp added a commit that referenced this issue Nov 19, 2021
@greg7mdp
Copy link
Owner

Hi @fegennari , please try the new version I just pushed in. You are right that the issue was updating capacity_ before the memory allocation, leaving an inconsistent state if the memory allocation throwed. So I delayed changing it until after the allocation. Thank you so much for the thorough investigation!

@fegennari
Copy link
Author

Thanks! That appears to have fixed the original failure, but I still see the failure related to prepare_insert(). I'm less confident I understand that problem. See my previous two comments.

@greg7mdp
Copy link
Owner

Yes, I think you understand the problem and explained it well. I see the issue, but I'm not quite sure how to fix it cleanly. Interestingly the google's Abseil-cpp hash maps from which phmap is derived have the same issue.
Actually, I just had an idea to fix it. Please try the latest version.

greg7mdp added a commit that referenced this issue Nov 19, 2021
@greg7mdp
Copy link
Owner

Sorry, I reverted my last change which did not seem to work, even though it did when I tested locally. I'll think it over.

@fegennari
Copy link
Author

Thanks for trying but you're right, it doesn't work:
eclair: ../src/parallel_hashmap/phmap.h:2143: void phmap::priv::raw_hash_set<Policy, Hash, Eq, Alloc>::emplace_at(size_t, Args&& ...) [with Args = {const std::piecewise_construct_t&, std::tuple<long unsigned int&&>, std::tuple<const char*&&>}; Policy = phmap::priv::FlatHashMapPolicy<long unsigned int, std::basic_string >; Hash = phmap::Hash; Eq = phmap::EqualTo; Alloc = std::allocator<std::pair<const long unsigned int, std::basic_string > >; size_t = long unsigned int]: Assertion `PolicyTraits::apply(FindElement{*this}, *iterator_at(i)) == iterator_at(i) && "constructed value does not match the lookup key"' failed.

@fegennari
Copy link
Author

I found one somewhat hacky way of fixing it. I replaced the construct() call in emplace_at() with a try/catch block like this:
try {
PolicyTraits::construct(&alloc_ref(), slots_ + i, std::forward(args)...);
}
catch( std::bad_alloc const &e ) {
// mark this slot as empty so that the dtor doesn't seg fault on invalid data
set_ctrl(i, kEmpty);
throw e;
}

This catches std::bad_alloc and marks the slot as empty so that the destructor is never called on it. I don't think this works in general due to various possible issues:

  • Can this lead to a memory leak of other allocations associated with this slot?
  • What about the constructor calling some other exception? (I use a custom allocator that always throws bad_alloc when malloc() return null.)
  • Is this portable? Maybe some header is needed at the top to define std::bad_alloc.
  • Is there runtime overhead for this approach? As far as I can tell there's no observable runtime difference.

But this does appear to fix the problem. I can run this in a loop with various values of ulimit and I've never seen it seg fault. I'm not sure if there are other similar control flow paths that still fail in classes/member functions that I'm not testing. For example, I haven't tried to figure out what lazy_emplace_at() does and if that flow needs some similar fix.

If you have a better fix I can try it out.

[Background on why I need this: We have a job system that runs various tasks from a higher level script to create in-memory databases for analysis. The user specifies a memory limit for each one to avoid swapping, etc. When the limit is hit, the process returns a special error return code that the job system can use to skip failed jobs or reschedule with different options to reduce memory usage. Seg fault does not result in the expected return code and tends to break the top-level flow.]

@greg7mdp
Copy link
Owner

Yes, I did think of doing that, and it is definitely a valid solution, but phmap doesn't throw any exception right now, and I think some users rely on this so I don't want to introduce it.
What about you keep this change in your version for now, and I'll try to see if I can find a solution which doesn't rely on catching and throwing exceptions.

@greg7mdp
Copy link
Owner

Your use case is perfectly valid, and I do need to fix this!

@fegennari
Copy link
Author

Yes my solution isn't the best one for all users. Our codebase uses exceptions everywhere already, which is why I'm okay with this. I wanted to put the fix directly in emplace_at() rather than having to touch all of the callers of that function. If you can find a more general solution that works, I'll take your changes instead. Thanks.

greg7mdp added a commit that referenced this issue Apr 16, 2023
when constructing an object to be inserted throws std::bad_alloc, the slot was mark as used
(even though the object was not properly constructed), so eventually the destructor of a
non-initialized object was called causing a segfault.

Solution: mark the slot used only after the object is successfully constructed.
@greg7mdp
Copy link
Owner

Better late than never. I think the issue should be fixed now.

@fegennari
Copy link
Author

Great, thanks! I haven't run into that problem since I posted the original issue, so it doesn't seem too common. I'll see if I can find my original test case and check to see if the new version of phmap fixes it.

@greg7mdp
Copy link
Owner

I would really appreciate if you can find the test case and try it, thanks! Maybe you haven't seen it again because you use your modification of phmap described above to avoid it?

@fegennari
Copy link
Author

I'm not sure what I currently have the code doing. I'll check when I'm at work on Monday. Thanks.

@greg7mdp
Copy link
Owner

Cool, thanks.

You did mention that earlier.

I found one somewhat hacky way of fixing it. I replaced the construct() call in emplace_at() with a try/catch block like this:

try {
    PolicyTraits::construct(&alloc_ref(), slots_ + i, std::forward(args)...);
}
catch( std::bad_alloc const &e ) {
    // mark this slot as empty so that the dtor doesn't seg fault on invalid data
    set_ctrl(i, kEmpty);
    throw e;
}

If you still use that, it should not be necessary anymore with the current repo version.

@fegennari
Copy link
Author

Yes, it looks like I still had that try/catch wrapper fix. I updated to the latest parallel_hashmap and overwrote my changes, and it appears to work now. I don't remember exactly what my previous failing test case was, but the simple string insert failure I included in my initial issue report is working now. I also ran some other tests that worked. It looks like the problem is fixed. Thanks!

@greg7mdp
Copy link
Owner

Awesome, thanks a lot for checking it out @fegennari!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants