Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement lightweight RTTI for (not only) IR::Node class hierarchies #4377

Merged
merged 12 commits into from
Feb 13, 2024

Conversation

asl
Copy link
Contributor

@asl asl commented Jan 30, 2024

C++ RTTI is heavily used in p4c for type checking and downcasting from IR::Node to a particular descendant. However, generic RTTI implementation as provided by C++ runtime is very heavy and results in many overheads. Typically RTTI support routines (e.g. dynamic_cast and corresponding typeid checks) are among top 5 in p4c execution profile.

Unfortunately, p4c cannot re-use static RTTI as implemented by e.g. LLVM as IR::Node class hierarchy contains multiple inheritance with abstract and virtual bases and cross-casting to virtual base is normal in p4c.

This PR implements a bit more sophisticated (as compared by LLVM) RTTI implementation for semi-open class hierarchies. It requires some boilerplate code, but the majority of it is autogenerated by ir-generator. Still, manual implementation is also possible as shown by other class hierarchies.

The major features of this implementation are:

  • typeid for a given class could be either specified manually or auto-generated by a compiler
  • typeid for IR::Node and its descendants are generated by ir-generator and therefore it is possible (in theory) to switch over different node types in addition to other type of polymorphism. Switching over node types in Vector<T> and IndexedVector<T> is possible as well.
  • ->typeId() method is fast, it's just a single virtual function call
  • ->is<T>() method is usually compiled down to several typeid checks by a compiler and is fast as well
  • ->to<T>() is slightly more elaborated, but still quite fast as we rely on compiler to generate necessary this-adjustment thunks, the implementation fast-paths the case when ->to<T>() returns nullptr.

Overall the implementation provides a solid 20-25% compile time speedup on various inputs. For example, compare gtestp4c runtime (benchmarking using hyperfine, 10 iterations plus a warmup run) on my laptop:

Command Mean [s] Min [s] Max [s] Relative
gtestp4c-main 22.671 ± 0.191 22.367 22.988 1.31 ± 0.01
gtestp4c-rtti 17.245 ± 0.095 17.132 17.417 1.00

The bulk implementation is in lib/rtti.h and relies a bit on constexpr trickery. I checked that everything is resolved down to compile-time constants for gcc >= 9 and recent versions of clang.

Overall, nothing should be changed for the users provided that they use methods from ICastable. Direct use of C++ RTTI is supported as well, but should be discouraged due to slowness and overheads.

I also made few cleanups while there:

  • Unchecked ->to<T>() are converted either to ->checkedTo<T>() or ->as<T>() depending on the usage
  • dynamic_cast was systematically replaced by methods of ICastable
  • Some missed override and const on methods were added to silence compiler warnings
  • json loader was refreshed a bit while porting to Json hierarchy to new RTTI

@asl asl force-pushed the custom-rtti branch 6 times, most recently from e12f829 to 24c02d0 Compare January 30, 2024 08:28
@fruffy fruffy added core Topics concerning the core segments of the compiler (frontend, midend, parser) run-ubuntu18 Use this tag to trigger a Ubuntu-18 CI run. run-validation Use this tag to trigger a Validation CI run. run-sanitizer Use this tag to run a Clang+Sanitzers CI run. labels Jan 30, 2024
@vlstill
Copy link
Contributor

vlstill commented Jan 30, 2024

Speeding up RTTI/to/as/is is definitely a good idea (and it would also be a good idea to go over the code and replace very common is+to combo by just to). That said, I am wondering if this is just too complex, especially for the node hierarchy (BTW. I did not yet decode how this works, this will definitely need a documentation if we want to take this approach).

I am wondering whether it would be possible to make use of existing functions in the Node hierarchy, namely the mechanism used for visitors. After all, visitors do dynamic type resolution and it is much cheaper there. I think something like this should be doable (take this as a pseudo code, it is not tested):

template<typename T>
const T *to() const {
    struct CastVisitor : Inspector {
        const T *out = nullptr;
        bool preorder(const T *n) override {
            out = n;
            return false;
        }
        bool preorder(const IR::Node *n) overrinde { return false; } // or whatever is root of the hierarchy
    };
    CastVisitor vis;
    this->apply_visitor_preorder(vis);
    return vis.out;
}

The idea is to let the generated IR hierarchy/visitor pattern resolved the types for us. With a bit of optimization (a specific light-weight apply for this) it should need just two virtual calls -- this optimization would probably be needed if especially if we wanted to make the non-const to fast too. A care must be taken this works for non-generated classes as well, ideally with minimal boilerplate, a fallback to dynamic_cast for them might go 99.9% the way we need.

I also made few cleanups while there:

  • Unchecked ->to() are converted either to ->checkedTo() or ->as() depending on the usage
  • dynamic_cast was systematically replaced by methods of ICastable
  • Some missed override and const on methods were added to silence compiler warnings
  • json loader was refreshed a bit while porting to Json hierarchy to new RTTI

I must say I'm not very happy with such a big change containing not-directly-related changes.

Copy link
Collaborator

@fruffy fruffy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, thanks for these contributions, great to see some performance work being done.

Two questions/requests:

  1. How does this affect compile time? The compiler nowadays has slow compile times, partially because of macro usage that got out of hand.

Overall the implementation provides a solid 20-25% compile time speedup on various inputs. For example, compare gtestp4c runtime (benchmarking using hyperfine, 10 iterations plus a warmup run) on my laptop:

You mean runtime here?

  1. It is possible to split out the replacements of dynamic_cast with ICastable into a separate PR? That seems like a safe change and is easier to review. Ditto for all the smallish fixes you implemented on the way.

docs/C++.md Outdated Show resolved Hide resolved
@fruffy fruffy requested a review from ChrisDodd January 30, 2024 14:32
@asl
Copy link
Contributor Author

asl commented Jan 30, 2024

I am wondering whether it would be possible to make use of existing functions in the Node hierarchy, namely the mechanism used for visitors. After all, visitors do dynamic type resolution and it is much cheaper there. I think something like this should be doable (take this as a pseudo code, it is not tested):

Please don't. The existing Visitors have very large overhead and GC traffic. I would probably say that they have even more overhead than existing C++ RTTI usage.

it should need just two virtual calls -- this optimization would probably be needed if especially if we wanted to make the non-const to fast too.

The RTTI implementation is just a single virtual function call (using this for this adjustment due to virtual bases use).

I must say I'm not very happy with such a big change containing not-directly-related changes.

The changes are directly related, sadly. Sorry, I should make it clearer:

  • The code used a mixture of ICastable and dynamic_cast calls. So, I had to unify it
  • The code contained 3 or 4 duplicates of ICastable functionality, again this should be cleaned
  • In few places dynamic_cast was used to cast away constness. With ICastable interface this was a compiler error, so had to be fixed

@asl
Copy link
Contributor Author

asl commented Jan 30, 2024

You mean runtime here?

Compiler runtime, yes. We are improving the compilation time here which is quite.. long even for small inputs. Here is the example profile of P4CParserUnroll.switch_20160512 from unit test:

Screenshot 2024-01-30 at 08 58 48

Red arrow indicates C++ RTTI calls that totals to 31% of overall runtime (most of GC traffic is from Visitors boilerplate, ignore it for now).

Overall, as you can see, there is no real code executed here. The runtime is dominated by boilerplate ;)

@asl
Copy link
Contributor Author

asl commented Jan 30, 2024

  1. How does this affect compile time? The compiler nowadays has slow compile times, partially because of macro usage that got out of hand.

I checked the compiler trace profile and I do not see preprocessor being an issue here. Part of the problem is that ir.h is just one huge file that is included everywhere, we can probably discuss this in the separate issue. So far I do not see anywhere where rtti.h would pop up in the compiler profile. Note that typeid's are statically generated by ir-generator for the whole IR::Node hierarchy and therefore the possible compile impact of constexpr fallbacks are negligible there.

@vlstill
Copy link
Contributor

vlstill commented Jan 30, 2024

I am wondering whether it would be possible to make use of existing functions in the Node hierarchy, namely the mechanism used for visitors. After all, visitors do dynamic type resolution and it is much cheaper there. I think something like this should be doable (take this as a pseudo code, it is not tested):

Please don't. The existing Visitors have very large overhead and GC traffic. I would probably say that they have even more overhead than existing C++ RTTI usage.

The existing visitors have, but the visitor pattern in general should not. That is what I had in mind with the additional apply overload.


I must say I'm not very happy with such a big change containing not-directly-related changes.

The changes are directly related, sadly. Sorry, I should make it clearer:

  • The code used a mixture of ICastable and dynamic_cast calls. So, I had to unify it

  • The code contained 3 or 4 duplicates of ICastable functionality, again this should be cleaned

  • In few places dynamic_cast was used to cast away constness. With ICastable interface this was a compiler error, so had to be fixed

Hmm, I see. These fixes could still be done before the RTTI but I can definitely see why they ended in one PR.


I will try to have a look at your RTTI mechanism in more detail.

@asl
Copy link
Contributor Author

asl commented Jan 30, 2024

Hmm, I see. These fixes could still be done before the RTTI but I can definitely see why they ended in one PR.

I can factor out this into a separate PR just in case, if this would make review easier.

@fruffy
Copy link
Collaborator

fruffy commented Jan 30, 2024

I can factor out this into a separate PR just in case, if this would make review easier.

Yes please! Also because it might take longer to get RTTI things in.

@asl
Copy link
Contributor Author

asl commented Jan 30, 2024

I will try to have a look at your RTTI mechanism in more detail.

Thanks! It's just lib/rtti.h.

Copy link
Contributor

@vlstill vlstill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've taken a first look at rtti.h, I did not play with it to check if some of my suggestions work as I don't have time that today. I have a few points where I could see some simplifications, but I am not sure if they can be all done.

I believe overall this would be a meaningful change with a good promise. Most of the boilerplate will be generated, so I am not overly concerned about that.

docs/C++.md Outdated Show resolved Hide resolved
lib/rtti.h Outdated Show resolved Hide resolved
lib/rtti.h Show resolved Hide resolved
lib/rtti.h Show resolved Hide resolved
lib/rtti.h Show resolved Hide resolved
ir/indexed_vector.h Outdated Show resolved Hide resolved
lib/rtti.h Show resolved Hide resolved
@asl asl force-pushed the custom-rtti branch 2 times, most recently from 4197440 to 7ab6d33 Compare February 1, 2024 22:06
@asl
Copy link
Contributor Author

asl commented Feb 1, 2024

I rebased the PR, so now it includes only RTTI-related changes as ICastable pre-requisite was merged in.

Copy link
Contributor

@vlstill vlstill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I now understand how this works and I think this is good. Thank you both for the implementation and the explanations.

While I understand that the idea behind p4c was to make the compiler conceptually relatively simple at some performance cost, I believe we should strive to make the compiler faster if the cost of doing so is reasonable. If we want P4 to be popular and the P4 compilers to be based on p4c it needs to be usable and that requires speed too. As @asl said, custom RTTI is used in other places, LLVM being one of the prime compiler examples.

I have few more comment & questions.

(I did not go over all the changes through the code base, but I've glanced at parts of them).

lib/rtti.h Show resolved Hide resolved
lib/rtti.h Show resolved Hide resolved
lib/rtti.h Show resolved Hide resolved
tools/ir-generator/irclass.cpp Outdated Show resolved Hide resolved
lib/rtti.h Outdated Show resolved Hide resolved
ir/indexed_vector.h Outdated Show resolved Hide resolved
lib/rtti.h Show resolved Hide resolved
lib/rtti.h Outdated Show resolved Hide resolved
lib/rtti.h Show resolved Hide resolved
backends/bmv2/common/controlFlowGraph.h Show resolved Hide resolved
@asl asl requested a review from vlstill February 8, 2024 20:03
@asl
Copy link
Contributor Author

asl commented Feb 8, 2024

@vlstill @fruffy @grg Any further comments? Thanks!

@fruffy
Copy link
Collaborator

fruffy commented Feb 9, 2024

@vlstill @fruffy @grg Any further comments? Thanks!

Will try to give a final look tomorrow.

Copy link
Collaborator

@fruffy fruffy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Let's simmer this until Monday before merging?
@ChrisDodd any comments?


static constexpr uint64_t FNV1a(TypeNameHolder str) { return FNV1a(str.str, str.length); }

// TypeIdResolver provides a unified way to getting typeid of the type `T`. Typeid
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Inconsistent use of // and /// here and in other headers. Or I can't tell the system used here. From what I can tell, we use /// nowadays for function/class documentation and // for inline comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was intentional. TypeIdResolver is an internal class, so its description should not be a part of documentation. Still, I wanted to keep some kind of comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference between /// and // is how it affects Doxygen -- which I can never keep straight

@ChrisDodd
Copy link
Contributor

It is disturbing to me that this is more efficient than dyamic_cast -- dynamic_cast should be faster than a virtual function call. I would have thought that C++ compilers would have fixed their poor implementations by now.

I'll try to look this over in more detail, buy overall it seems acceptable.

@asl
Copy link
Contributor Author

asl commented Feb 11, 2024

It is disturbing to me that this is more efficient than dyamic_cast -- dynamic_cast should be faster than a virtual function call. I would have thought that C++ compilers would have fixed their poor implementations by now.

Well... not quite given all the requirements standard imposes and given that p4c uses the most complex case (multiple inheritance with virtual bases). It's in the same way as C+ standard essentially demands for one particular implementation of std::unordered_map banning open addressing, etc. See https://github.com/llvm/llvm-project/blob/main/libcxxabi/src/private_typeinfo.cpp#L897 as an example of current implementation

I'll try to look this over in more detail, buy overall it seems acceptable.

Thanks!

Copy link
Contributor

@ChrisDodd ChrisDodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable overall.

It still bothers me that the C++ compiler uses more than a single indirection through the vtable plus a conditional move or branch and a single add instruction for dynamic_cast

Parents::TypeInfo::dyn_cast(typeId, ptr)...};

auto it =
std::find_if(ptrs.begin(), ptrs.end(), [](const void *ptr) { return ptr != nullptr; });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this effectively call the parent dyn_cast for all the parents recursively before looking to see if any are non-null? As soon as a single parent returns a non-null value, you can return that -- no need to check other parents.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this and it seems to yield much worse performance. What happens here is that compiler is happily inlines everything (through parents) and we end with quite nice and tight implementation. Adding lazy resolution / early returns here prevents this – too much conditions are added.


auto it =
std::find_if(ptrs.begin(), ptrs.end(), [](const void *ptr) { return ptr != nullptr; });
return (it != ptrs.end()) ? *it : nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling to find a good way of adding a cache here to avoid the recursive parent calls at least some time, making this more in line of what the cost of a dynamic_cast should be (one indirection through the vtable + a conditional branch or move + an add), but it would require a bunch of profiling to see if it would be effective. Something like:

static std::pair<TypeId, intptr_t> cache[kCacheSize];
if (cache[typeId % kCacheSize].first == typeId])
    return reinterpret_cast<const This *>(reinterpret_cast<intptr_t>(ptr) + cache[typeId % kCacheSize].second);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above: since we're in constexpr context the code is decently inlined into a tight linear code, so there is no recursion left. Adding some additional stuff like caching or early returns seem to yield worse recursive code in the end :)

protected:
[[nodiscard]] virtual const void *toImpl(TypeId typeId) const noexcept = 0;
};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be useful to add

template <typename T>
static inline const T *to(const Base *ptr) noexcept {
    return ptr ? ptr->to<T>() : nullptr;
}

That way, most dynamic_cast calls (eg, those in all the Visitor::visit functions) could be replaced with RTTT::to

Copy link
Contributor Author

@asl asl Feb 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already replaced dynamic_cast in Visitor, maybe I missed something?

/// };
#define DECLARE_TYPEINFO(T, ...) \
private: \
static constexpr RTTI::TypeId static_typeId() { return RTTI::InvalidTypeId; } \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than doing all the complex hasing stuff, why not just something like

static constexpr RTTI::TypeId static_typeId() { return reinterpret_cast<uintptr_t>(&static_typeId); }

That way, the linker will provide a distinct value (address) for each class automatically.

Copy link
Contributor Author

@asl asl Feb 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This inhibits many optimizations for is<> or where typeid is involved. Having typeid as a compile-time constant (instead of link-time) allows for some nice simplifications in the set of comparisons. Another things is that I'd like to have more or less stable value, not something that changes with the compiler, here we're having different values after every re-compile potentially...

@asl
Copy link
Contributor Author

asl commented Feb 11, 2024

It still bothers me that the C++ compiler uses more than a single indirection through the vtable plus a conditional move or branch and a single add instruction for dynamic_cast

It is indeed so. The key issue is that we need to compute the offset dynamically. And here is the problem as RTTI only holds information about immediate bases. Therefore we need to traverse RTTI for class hierarchy to compute the offset given source and target classes. Virtual bases makes whole picture more complicated.

@asl
Copy link
Contributor Author

asl commented Feb 12, 2024

I am going to land this provided that there will be no last minute objections :)

@fruffy
Copy link
Collaborator

fruffy commented Feb 12, 2024

I am going to land this provided that there will be no last minute objections :)

Go for it! I will disable merging for #4374 because it introduces another ICastable object. That will break.

@asl
Copy link
Contributor Author

asl commented Feb 12, 2024

@fruffy Thanks!

@asl asl enabled auto-merge (squash) February 12, 2024 22:42
@asl asl merged commit 0fa9087 into p4lang:main Feb 13, 2024
16 checks passed
@asl asl deleted the custom-rtti branch February 27, 2024 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Topics concerning the core segments of the compiler (frontend, midend, parser) run-sanitizer Use this tag to run a Clang+Sanitzers CI run. run-ubuntu18 Use this tag to trigger a Ubuntu-18 CI run. run-validation Use this tag to trigger a Validation CI run.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants