-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement lightweight RTTI for (not only) IR::Node class hierarchies #4377
Conversation
e12f829
to
24c02d0
Compare
Speeding up RTTI/ I am wondering whether it would be possible to make use of existing functions in the template<typename T>
const T *to() const {
struct CastVisitor : Inspector {
const T *out = nullptr;
bool preorder(const T *n) override {
out = n;
return false;
}
bool preorder(const IR::Node *n) overrinde { return false; } // or whatever is root of the hierarchy
};
CastVisitor vis;
this->apply_visitor_preorder(vis);
return vis.out;
} The idea is to let the generated IR hierarchy/visitor pattern resolved the types for us. With a bit of optimization (a specific light-weight apply for this) it should need just two virtual calls -- this optimization would probably be needed if especially if we wanted to make the non-const
I must say I'm not very happy with such a big change containing not-directly-related changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, thanks for these contributions, great to see some performance work being done.
Two questions/requests:
- How does this affect compile time? The compiler nowadays has slow compile times, partially because of macro usage that got out of hand.
Overall the implementation provides a solid 20-25% compile time speedup on various inputs. For example, compare gtestp4c runtime (benchmarking using hyperfine, 10 iterations plus a warmup run) on my laptop:
You mean runtime here?
- It is possible to split out the replacements of
dynamic_cast
with ICastable into a separate PR? That seems like a safe change and is easier to review. Ditto for all the smallish fixes you implemented on the way.
Please don't. The existing Visitors have very large overhead and GC traffic. I would probably say that they have even more overhead than existing C++ RTTI usage.
The RTTI implementation is just a single virtual function call (using this for
The changes are directly related, sadly. Sorry, I should make it clearer:
|
I checked the compiler trace profile and I do not see preprocessor being an issue here. Part of the problem is that |
The existing visitors have, but the visitor pattern in general should not. That is what I had in mind with the additional apply overload.
Hmm, I see. These fixes could still be done before the RTTI but I can definitely see why they ended in one PR. I will try to have a look at your RTTI mechanism in more detail. |
I can factor out this into a separate PR just in case, if this would make review easier. |
Yes please! Also because it might take longer to get RTTI things in. |
Thanks! It's just |
d73d2db
to
62b572e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've taken a first look at rtti.h
, I did not play with it to check if some of my suggestions work as I don't have time that today. I have a few points where I could see some simplifications, but I am not sure if they can be all done.
I believe overall this would be a meaningful change with a good promise. Most of the boilerplate will be generated, so I am not overly concerned about that.
4197440
to
7ab6d33
Compare
I rebased the PR, so now it includes only RTTI-related changes as ICastable pre-requisite was merged in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I now understand how this works and I think this is good. Thank you both for the implementation and the explanations.
While I understand that the idea behind p4c was to make the compiler conceptually relatively simple at some performance cost, I believe we should strive to make the compiler faster if the cost of doing so is reasonable. If we want P4 to be popular and the P4 compilers to be based on p4c it needs to be usable and that requires speed too. As @asl said, custom RTTI is used in other places, LLVM being one of the prime compiler examples.
I have few more comment & questions.
(I did not go over all the changes through the code base, but I've glanced at parts of them).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Let's simmer this until Monday before merging?
@ChrisDodd any comments?
|
||
static constexpr uint64_t FNV1a(TypeNameHolder str) { return FNV1a(str.str, str.length); } | ||
|
||
// TypeIdResolver provides a unified way to getting typeid of the type `T`. Typeid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Inconsistent use of //
and ///
here and in other headers. Or I can't tell the system used here. From what I can tell, we use /// nowadays for function/class documentation and //
for inline comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was intentional. TypeIdResolver
is an internal class, so its description should not be a part of documentation. Still, I wanted to keep some kind of comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference between ///
and //
is how it affects Doxygen -- which I can never keep straight
It is disturbing to me that this is more efficient than dyamic_cast -- dynamic_cast should be faster than a virtual function call. I would have thought that C++ compilers would have fixed their poor implementations by now. I'll try to look this over in more detail, buy overall it seems acceptable. |
Well... not quite given all the requirements standard imposes and given that p4c uses the most complex case (multiple inheritance with virtual bases). It's in the same way as C+ standard essentially demands for one particular implementation of
Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable overall.
It still bothers me that the C++ compiler uses more than a single indirection through the vtable plus a conditional move or branch and a single add instruction for dynamic_cast
Parents::TypeInfo::dyn_cast(typeId, ptr)...}; | ||
|
||
auto it = | ||
std::find_if(ptrs.begin(), ptrs.end(), [](const void *ptr) { return ptr != nullptr; }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this effectively call the parent dyn_cast
for all the parents recursively before looking to see if any are non-null? As soon as a single parent returns a non-null value, you can return that -- no need to check other parents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this and it seems to yield much worse performance. What happens here is that compiler is happily inlines everything (through parents) and we end with quite nice and tight implementation. Adding lazy resolution / early returns here prevents this – too much conditions are added.
|
||
auto it = | ||
std::find_if(ptrs.begin(), ptrs.end(), [](const void *ptr) { return ptr != nullptr; }); | ||
return (it != ptrs.end()) ? *it : nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm struggling to find a good way of adding a cache here to avoid the recursive parent calls at least some time, making this more in line of what the cost of a dynamic_cast
should be (one indirection through the vtable + a conditional branch or move + an add), but it would require a bunch of profiling to see if it would be effective. Something like:
static std::pair<TypeId, intptr_t> cache[kCacheSize];
if (cache[typeId % kCacheSize].first == typeId])
return reinterpret_cast<const This *>(reinterpret_cast<intptr_t>(ptr) + cache[typeId % kCacheSize].second);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above: since we're in constexpr
context the code is decently inlined into a tight linear code, so there is no recursion left. Adding some additional stuff like caching or early returns seem to yield worse recursive code in the end :)
protected: | ||
[[nodiscard]] virtual const void *toImpl(TypeId typeId) const noexcept = 0; | ||
}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be useful to add
template <typename T>
static inline const T *to(const Base *ptr) noexcept {
return ptr ? ptr->to<T>() : nullptr;
}
That way, most dynamic_cast
calls (eg, those in all the Visitor::visit
functions) could be replaced with RTTT::to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already replaced dynamic_cast
in Visitor
, maybe I missed something?
/// }; | ||
#define DECLARE_TYPEINFO(T, ...) \ | ||
private: \ | ||
static constexpr RTTI::TypeId static_typeId() { return RTTI::InvalidTypeId; } \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than doing all the complex hasing stuff, why not just something like
static constexpr RTTI::TypeId static_typeId() { return reinterpret_cast<uintptr_t>(&static_typeId); }
That way, the linker will provide a distinct value (address) for each class automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This inhibits many optimizations for is<>
or where typeid is involved. Having typeid as a compile-time constant (instead of link-time) allows for some nice simplifications in the set of comparisons. Another things is that I'd like to have more or less stable value, not something that changes with the compiler, here we're having different values after every re-compile potentially...
It is indeed so. The key issue is that we need to compute the offset dynamically. And here is the problem as RTTI only holds information about immediate bases. Therefore we need to traverse RTTI for class hierarchy to compute the offset given source and target classes. Virtual bases makes whole picture more complicated. |
I am going to land this provided that there will be no last minute objections :) |
Go for it! I will disable merging for #4374 because it introduces another ICastable object. That will break. |
@fruffy Thanks! |
C++ RTTI is heavily used in p4c for type checking and downcasting from IR::Node to a particular descendant. However, generic RTTI implementation as provided by C++ runtime is very heavy and results in many overheads. Typically RTTI support routines (e.g.
dynamic_cast
and corresponding typeid checks) are among top 5 in p4c execution profile.Unfortunately, p4c cannot re-use static RTTI as implemented by e.g. LLVM as IR::Node class hierarchy contains multiple inheritance with abstract and virtual bases and cross-casting to virtual base is normal in p4c.
This PR implements a bit more sophisticated (as compared by LLVM) RTTI implementation for semi-open class hierarchies. It requires some boilerplate code, but the majority of it is autogenerated by
ir-generator
. Still, manual implementation is also possible as shown by other class hierarchies.The major features of this implementation are:
IR::Node
and its descendants are generated byir-generator
and therefore it is possible (in theory) to switch over different node types in addition to other type of polymorphism. Switching over node types inVector<T>
andIndexedVector<T>
is possible as well.->typeId()
method is fast, it's just a single virtual function call->is<T>()
method is usually compiled down to several typeid checks by a compiler and is fast as well->to<T>()
is slightly more elaborated, but still quite fast as we rely on compiler to generate necessary this-adjustment thunks, the implementation fast-paths the case when->to<T>()
returns nullptr.Overall the implementation provides a solid 20-25% compile time speedup on various inputs. For example, compare
gtestp4c
runtime (benchmarking using hyperfine, 10 iterations plus a warmup run) on my laptop:gtestp4c-main
gtestp4c-rtti
The bulk implementation is in
lib/rtti.h
and relies a bit on constexpr trickery. I checked that everything is resolved down to compile-time constants for gcc >= 9 and recent versions of clang.Overall, nothing should be changed for the users provided that they use methods from
ICastable
. Direct use of C++ RTTI is supported as well, but should be discouraged due to slowness and overheads.I also made few cleanups while there:
->to<T>()
are converted either to->checkedTo<T>()
or->as<T>()
depending on the usagedynamic_cast
was systematically replaced by methods ofICastable
override
andconst
on methods were added to silence compiler warningsJson
hierarchy to new RTTI