Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tpetra: incorrect behavior when a non-trivial series of moves are happening #11931

Closed
fnrizzi opened this issue May 31, 2023 · 23 comments
Closed
Labels
CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. pkg: Tpetra type: bug The primary issue is a bug in Trilinos code or tests

Comments

@fnrizzi
Copy link

fnrizzi commented May 31, 2023

Bug Report

@csiefer2

Description

i am trying to create a "registry" (a struct) holding instances of tpetra block vectors and then move the registry object to another object Foo that owns that registry and use it inside of it. See below the usecase.
if i create the registry directly and move it to the constructor of Foo, all extents printed make sense.
If i create an instance of Foo by calling createFoo, then things do not work.

#include <Tpetra_Core.hpp>
#include <Tpetra_BlockVector.hpp>
#include <Tpetra_Map.hpp>
#include <Teuchos_CommHelpers.hpp>
#include <Tpetra_Map_decl.hpp>


using tcomm    = Teuchos::Comm<int>;
using map_t    = Tpetra::Map<>;
using vec_type = Tpetra::BlockVector<>;

class MyObject{
private:
  map_t map_;
public:
  using state_type = vec_type;

  MyObject(){
    Teuchos::RCP<const tcomm> comm_ = Teuchos::rcp (new Teuchos::MpiComm<int>(MPI_COMM_WORLD));
    const int numGlobalEntries = 15*comm_->getSize();
    map_ = map_t(numGlobalEntries, 0, comm_);
  }
  state_type createState() const{
    state_type v(map_, 5);
    return v;
  }
};

template<class DT1, class DT2>
struct Registry{
  DT1 d1_;
  DT2 d2_;
  DT1 & getA(){ return d1_; }
  DT2 & getB(){ return d2_; }
};

void print(int rank, const std::string & s1, const std::string & s2, vec_type r){
  std::cout << "\n";
  auto tpmv  = r.getVectorView();
  std::cout << s1 << "ext = " << r.getMap()->getGlobalNumElements() << "\n";
  std::cout << s2 << "ext = " << tpmv.getMap()->getGlobalNumElements() << "\n";
}

template<class RegistryType>
class Foo{
  int rank_;
  RegistryType reg_;
public:
  Foo(RegistryType && reg) : reg_(std::move(reg))
  {
    MPI_Comm_rank(MPI_COMM_WORLD, &rank_);
    auto & a = reg_.getA();
    auto & b = reg_.getB();
    print(rank_, "CONSTR-block-A ", "CONSTR-tpetr-A ", a);
    print(rank_, "CONSTR-block-B ", "CONSTR-tpetr-B ", b);
  }

  void dummy() {
    auto & a = reg_.getA();
    auto & b = reg_.getB();
    print(rank_, "DUMMY-block-A ", "DUMMY-tpetr-A ", a);
    print(rank_, "DUMMY-block-B ", "DUMMY-tpetr-B ", b);
  }
};

template<class T>
auto createFoo(const T& system){
  using state_t    = typename T::state_type;
  using registry_t = Registry<state_t, state_t>;
  registry_t reg{system.createState(), system.createState()};
  return Foo<registry_t>(std::move(reg));
}

template<class T>
auto createRegistry(const T & system){
  using state_t    = typename T::state_type;
  using registry_t = Registry<state_t, state_t>;
  registry_t reg{system.createState(), system.createState()};
  return reg;
}

int main(int argc, char **argv)
{
  Tpetra::ScopeGuard tpetraScope (&argc, &argv);
  {
    {
      std::cout << "starting case that works\n";
      MyObject object;
      auto reg = createRegistry(object);
      Foo<decltype(reg)> f(std::move(reg));
      f.dummy();
    }

    {
      std::cout << "\n";
      std::cout << "\n";
      std::cout << "starting case that does NOT work\n";
      MyObject object;
      auto f = createFoo(object);
      f.dummy();
    }
  }

  return 0;
}

prints:

starting case that works

CONSTR-block-A ext = 15
CONSTR-tpetr-A ext = 75

CONSTR-block-B ext = 15
CONSTR-tpetr-B ext = 75

DUMMY-block-A ext = 15
DUMMY-tpetr-A ext = 75

DUMMY-block-B ext = 15
DUMMY-tpetr-B ext = 75


starting case that does NOT work

CONSTR-block-A ext = 15
CONSTR-tpetr-A ext = 75

CONSTR-block-B ext = 15
CONSTR-tpetr-B ext = 75

DUMMY-block-A ext = 15
DUMMY-tpetr-A ext = 1

DUMMY-block-B ext = 15
DUMMY-tpetr-B ext = 75

Notice how in the second part we get DUMMY-tpetr-A ext = 1 instead of 75.

Steps to Reproduce

  1. SHA1: 3cf925b but I found out this happens even with the ef73d14
  2. Configure script:
cmake -D CMAKE_BUILD_TYPE:STRING=Debug -D BUILD_SHARED_LIBS:BOOL=ON -D TPL_FIND_SHARED_LIBS=ON -D Trilinos_LINK_SEARCH_START_STATIC=OFF -D CMAKE_VERBOSE_MAKEFILE:BOOL=TRUE -D TPL_ENABLE_MPI:BOOL=ON -D MPI_C_COMPILER:FILEPATH=/Users/fnrizzi/Desktop/work/tpls/openmpi/install-gcc11/bin/mpicc -D MPI_CXX_COMPILER:FILEPATH=/Users/fnrizzi/Desktop/work/tpls/openmpi/install-gcc11/bin/mpic++ -D MPI_EXEC:FILEPATH= -D MPI_USE_COMPILER_WRAPPERS:BOOL=ON -D Trilinos_ENABLE_Fortran:BOOL=ON -D MPI_Fortran_COMPILER:FILEPATH=/Users/fnrizzi/Desktop/work/tpls/openmpi/install-gcc11/bin/mpifort -D Trilinos_ENABLE_TESTS:BOOL=OFF -D Trilinos_ENABLE_EXAMPLES:BOOL=OFF -D TPL_ENABLE_BLAS=ON -D TPL_ENABLE_LAPACK=ON -D Kokkos_ENABLE_SERIAL:BOOL=ON -D Kokkos_ENABLE_THREADS:BOOL=OFF -D Kokkos_ENABLE_OPENMP:BOOL=OFF -D Kokkos_ENABLE_DEPRECATED_CODE=OFF -D Trilinos_ENABLE_ALL_PACKAGES:BOOL=OFF -D Trilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=OFF -D Trilinos_ENABLE_Teuchos:BOOL=ON -D Trilinos_ENABLE_Epetra:BOOL=ON -D Trilinos_ENABLE_Tpetra:BOOL=ON -D Tpetra_ENABLE_DEPRECATED_CODE:BOOL=OFF -D Tpetra_ENABLE_TSQR:BOOL=ON -D Trilinos_ENABLE_EpetraExt:BOOL=ON -D Trilinos_ENABLE_AztecOO:BOOL=ON -D Trilinos_ENABLE_Ifpack:BOOL=ON -D Trilinos_ENABLE_Ifpack2:BOOL=ON -D Trilinos_ENABLE_ROL:BOOL=ON -D CMAKE_INSTALL_PREFIX:PATH=../install ../Trilinos
  1. Configure log: log.txt
  2. Compiler: gcc11-3.0 and openmpi 4.1.3 (however the code is run with just 1 rank)



Smaller reproducer and a solution that works

I was able to do this:

  • smaller reproducer: taking all the relevant classes from source code and strip out everything that is not needed
    tpetra_moving_bug_reproducer.txt
    This code prints the same as above.

  • one solution to this: i was able to make things work if I change the way the map is stored inside the DistObject class.
    if i store it by value instead of using the rcp, things work:
    tpetra_moving_bug_fix.txt and this prints:

starting case that works

CONSTR-block-A ext = 15
CONSTR-tpetr-A ext = 75

CONSTR-block-B ext = 15
CONSTR-tpetr-B ext = 75

DUMMY-block-A ext = 15
DUMMY-tpetr-A ext = 75

DUMMY-block-B ext = 15
DUMMY-tpetr-B ext = 75


starting case that does NOT work

CONSTR-block-A ext = 15
CONSTR-tpetr-A ext = 75

CONSTR-block-B ext = 15
CONSTR-tpetr-B ext = 75

DUMMY-block-A ext = 15
DUMMY-tpetr-A ext = 75

DUMMY-block-B ext = 15
DUMMY-tpetr-B ext = 75
@fnrizzi fnrizzi added the type: bug The primary issue is a bug in Trilinos code or tests label May 31, 2023
@masterleinad
Copy link
Contributor

Related to #11921.

@jhux2
Copy link
Member

jhux2 commented May 31, 2023

@trilinos/tpetra

@csiefer2
Copy link
Member

csiefer2 commented Jun 6, 2023

@fnrizzi I'm hoping #11946 will address this.

@fnrizzi
Copy link
Author

fnrizzi commented Jun 6, 2023

@csiefer2 thanks for the comment!
my issue above was with the serial backend on, how is that related to the threads backend?

@csiefer2
Copy link
Member

csiefer2 commented Jun 7, 2023

@fnrizzi Apologies for misreading your issue. I'll take a closer look.

@fnrizzi
Copy link
Author

fnrizzi commented Jun 7, 2023

@csiefer2 no worries! as written above, this issue pops up from a very recent sha and also a quite older one so it doees not look like anything new

@csiefer2
Copy link
Member

csiefer2 commented Jun 9, 2023

@fnrizzi My first attempt to translate this into a unit test produced the correct answer (75 instead of 1). I'm going to go try your example against an installed trilinos to see if I can reproduce the error.

@fnrizzi
Copy link
Author

fnrizzi commented Jun 9, 2023

Can you paste here the unit test? Just curious

@csiefer2
Copy link
Member

csiefer2 commented Jun 9, 2023

@fnrizzi When I your exact code against installed Trilinos, I get:

starting case that works

CONSTR-block-A ext = 15
CONSTR-tpetr-A ext = 75

CONSTR-block-B ext = 15
CONSTR-tpetr-B ext = 75

DUMMY-block-A ext = 15
DUMMY-tpetr-A ext = 75

DUMMY-block-B ext = 15
DUMMY-tpetr-B ext = 75


starting case that does NOT work

CONSTR-block-A ext = 15
CONSTR-tpetr-A ext = 75

CONSTR-block-B ext = 15
CONSTR-tpetr-B ext = 75

DUMMY-block-A ext = 15
DUMMY-tpetr-A ext = 9223372039002259455

DUMMY-block-B ext = 15
DUMMY-tpetr-B ext = 1

@fnrizzi
Copy link
Author

fnrizzi commented Jun 9, 2023

Yea that outrageous number appeared to me too , I don't remember when exactly but I saw that as well.

Also, please note that above there are 2 shas that give me the wrong behavior. And one sha is pretty old , I think a year ago or so

@csiefer2
Copy link
Member

csiefer2 commented Jun 9, 2023

I got the above using my VOTD checkout of develop.

And now my unit test code, cut and paste into an external example, generates the results I see above. Weird.

@fnrizzi
Copy link
Author

fnrizzi commented Jun 9, 2023

Yes weird indeed... btw the tentative solution I posted above worked and that seems to indicate it has something to do with the teuchos rcp inside the DistObject base class. Hopefully that helps as smaller reproducer

@brian-kelley
Copy link
Contributor

The bad MultiVector is probably off in uninitialized memory, maybe valgrind will give some useful info about what's going on

@csiefer2
Copy link
Member

csiefer2 commented Jun 9, 2023

As per my discussions with @cwpearson, it appears that BlockVector does not std::move correctly and that this may be the root of your issue.

@csiefer2
Copy link
Member

csiefer2 commented Jun 9, 2023

@brian-kelley The test basically points to old stack variables, so valgrind isn't helpful

@cwpearson
Copy link
Contributor

cwpearson commented Jun 9, 2023

To expand a bit, When you move your registry, you're moving a BlockMultiVector.

A BlockVector is a BlockMultiVector, which has a member MultiVector mv_.
The BlockMultiVector ctor takes a reference to a map, uses that to construct a member map, gets an RCP from that member map with rcpFromRef:

/** \brief Return a non-owning weak RCP object from a raw object reference for
* a defined type.
*
* NOTE: When debug mode is turned on, in general, the type must be defined.
* If the type is undefined, then the function <tt>rcpFromUndefRef()</tt>
* should be called instead.
*
* \relates RCP
*/
template<class T> inline
RCP<T> rcpFromRef(T& r);

and then passes that NON-OWNING rcp to the mv_:

pointMap_ (makePointMap (meshMap, blockSize)),
mv_ (Teuchos::rcpFromRef (pointMap_), numVecs), // nonowning RCP is OK, since pointMap_ won't go away

I think the comment on line 91 is wrong, because pointMap_ may in fact "go away", if the BMV is moved (pointMap_ is a map, not an RCP<map>)!

A fair number of the BlockMultiVector copy- and move- operations are defaulted:

//! Copy constructor (shallow copy).
BlockMultiVector (const BlockMultiVector<Scalar, LO, GO, Node>&) = default;
//! Move constructor (shallow move).
BlockMultiVector (BlockMultiVector<Scalar, LO, GO, Node>&&) = default;
//! Copy assigment (shallow copy).
BlockMultiVector<Scalar, LO, GO, Node>&
operator= (const BlockMultiVector<Scalar, LO, GO, Node>&) = default;
//! Move assigment (shallow move).
BlockMultiVector<Scalar, LO, GO, Node>&
operator= (BlockMultiVector<Scalar, LO, GO, Node>&&) = default;

So when BMV gets moved, the MV will also get moved, but that non-owning RCP inside mv_ will still point back to the map in the moved-from BMV, which may cause an error if the moved-from BMV is overwritten or otherwise invalid (which is permitted after a move).

I'm not sure exactly why this hasn't cropped up before, except that if people have been using RCP<BMV>, it's conceivable they won't have this problem because the heap-allocated BMV is not ever moved.

A big-picture solution might be be to make Teuchos::RCP always owning (a la std::shared_ptr), and have a non-owning pointer be a separate type (std::weak_ptr). Then misusing a non-owning pointer as an owning pointer (as in the multivector map) would be a compile-time error.

@csiefer2
Copy link
Member

csiefer2 commented Jun 9, 2023

The long-term solution I prefer is replacing the constructors with something that takes an RCP of a map and always storing internal shared data with RCPs.

We could band-aid the whole thing with something simpler, however. What's the real end goal for this registery thing?

@fnrizzi
Copy link
Author

fnrizzi commented Jun 10, 2023

@csiefer2

What's the real end goal for this registery thing?

The use case I reported above reproduces exactly what i need: an instance of the registry is something that stores the data and then this registry instance is moved to another object (a Foo instance) who would then own it permanently and manager its lifetime.

Specifically, the real use case is for implementing nonlinear solvers:

https://github.com/Pressio/pressio/tree/9d59e3235d743fdac29ec91a4dab76a3e471efae/include/pressio/solvers_nonlinear

The registry stores a bunch of operators and things we need. This registry is then moved to the actual nonlinear solver object that would own it. Here is a snippet:
https://github.com/Pressio/pressio/blob/9d59e3235d743fdac29ec91a4dab76a3e471efae/include/pressio/solvers_nonlinear/solvers_create_newton.hpp#L107-L150

does this clarify?

@csiefer2
Copy link
Member

@fnrizzi Yeah, it does, though if if I were implementing it I would use some kind of reference-counted pointer rather than rely on std::move doing what you hope for (std::move does not play nicely with pointers). Like I said, I can band-aid this (and will try to do so today). But I want to warn you that you're doing something very few people do (and that we don't test), so there's a very real chance this won't be the only bug like this you see.

@fnrizzi
Copy link
Author

fnrizzi commented Jun 12, 2023

@csiefer2 thanks for the reply!

std::move does not play nicely with pointers

Why std::move should not play nice with pointers? You means with Teuchos RCPs specifically or in general?

there's a very real chance this won't be the only bug like this you see.

yeah that is fine, if I can help finding them, am happy to :)
if the move operations are not well tested in tpetra, that is fine too

@csiefer2
Copy link
Member

csiefer2 commented Jun 12, 2023

Why std::move should not play nice with pointers? You means with Teuchos RCPs specifically or in general?

In general. If you std::move something that a pointer is pointing to without also changing the pointer, then your pointer is now pointing to who knows what.

if the move operations are not well tested in tpetra, that is fine too

Yeah, they're not well tested at all (to some degree because nobody uses them). The Block stuff is the least likely to work correctly, since that keeps a lot of references in addition to smart pointers.

@fnrizzi

Copy link

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

@github-actions github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Jun 12, 2024
Copy link

This issue was closed due to inactivity for 395 days.

@github-actions github-actions bot added the CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. label Jul 13, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. pkg: Tpetra type: bug The primary issue is a bug in Trilinos code or tests
Projects
Status: Done
Development

No branches or pull requests

6 participants