Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make SlotImpl detachable from its owner, and add a new runOnAllThreads interface to Slot. #8135

Merged
merged 15 commits into from
Sep 11, 2019

Conversation

stevenzzzz
Copy link
Contributor

@stevenzzzz stevenzzzz commented Sep 3, 2019

Description:
See the issue in #7902, this PR is to make the SlotImpl detachable from its owner, by introducing a Booker object wraps around a SlotImpl, which bookkeeps all the on-the-fly update callbacks. And on its destruction, if there are still on-the-fly callbacks, move the SlotImpl to an deferred-delete queue, instead of destructing the SlotImpl which may cause an SEGV error.

More importantly, introduce a new runOnAllThreads(ThreadLocal::UpdateCb cb) API to Slot, which requests a Slot Owner to not assume that the Slot or its owner will out-live (in Main thread) the fired on-the-fly update callbacks, and should not capture the Slot or its owner in the update_cb.

  • Introducing an new runOnAllThreads(UpdateCb) api to Slot interface, for which a update callback is
    expected to receive the currently stored ThreadLocal object, the callback may update or return a new ThreadLocal object that will be stored in the slot.

  • A Bookkeeper wraps around SlotImpl, which bookkeeps how many outgoing callbacks are on-the-fly. On its destruction been called in main thread, if the number doesn't go to zero, it will put the wrapped SlotImpl and its use count into a deffered delete queue.

  • In TLS InstanceImpl, add a deferred delete queue, a scheduleCleanup method will post a callback to main dispatcher. The callback deletes SlotImpl from the deferred delete queue, and reschedule another cleanup if the queue is not empty.

  • Picked RDS and config-providers-framework as examples to demonstrate that this change works. {i.e., changed from the runOnAllThreads(Event::PostCb) to the new runOnAllThreads(TLS::UpdateCb) interface. }

If the PR looks good, I will migrate existing (probably not all?) call-sites of runOnAllThreads to the new form in a follow up PR.

Risk Level: Medium
Testing: unit test
Docs Changes: N/A
Release Notes: N/A
[Optional Fixes #Issue] #7902

… to it.

Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
@stevenzzzz
Copy link
Contributor Author

/assign @mattklein123

@stevenzzzz
Copy link
Contributor Author

@htuch FYI. :)

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is an interesting solution. I've flushed out some comments assuming we go with this solution.

Stepping back though, can you refresh my memory on why this complexity is needed vs. just cancelling any potential updates before we start listener shutdown which I think would be serialized correctly? Thank you!

/wait

include/envoy/thread_local/thread_local.h Outdated Show resolved Hide resolved
include/envoy/thread_local/thread_local.h Outdated Show resolved Hide resolved
source/common/thread_local/thread_local_impl.h Outdated Show resolved Hide resolved
source/common/config/config_provider_impl.h Outdated Show resolved Hide resolved
void ConfigSubscriptionCommonBase::applyConfigUpdate(const ConfigUpdateCb& update_fn,
const Event::PostCb& complete_cb) {
tls_->runOnAllThreads(
[update_fn](ThreadLocal::ThreadLocalObjectSharedPtr previous)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: const ThreadLocal::ThreadLocalObjectSharedPtr&

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to express the idea that: ThreadLocal::Slot::UpdateCb is expected to receive data stored in the slot, and do-whatever-necessary, then return the new/updated version data to be saved in the slot. A non-const ThreadLocalObjectSharedPtr seems better under this notion, WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By passing by value you are essentially just forcing a copy in most cases. Personally, I think it's more common to make a new thing than use the previous, so I would pass by const-ref, and then if a copy needs to be made it can be done inside the function to put the cost there.

source/common/router/rds_impl.cc Outdated Show resolved Hide resolved
main_thread_dispatcher_->post([this]() {
deferred_deletes_.remove_if(
[](std::unique_ptr<SlotHolder>& holder) -> bool { return holder->isRecycleable(); });
if (!deferred_deletes_.empty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer a loop vs. recursion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me update the misleading comment.
This method will reschedule another cleanup task in the main thread dispatcher Q. Not exactly recursive.
We can't use a loop here as it basically means a lock/wait mechanism (on the ongoing update callbacks been finished).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this makes more sense now. Can you add a bunch more comments inside recycle and this function to describe what is going on and the various interleavings?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping on adding more comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, missed this.
I added some more comments around the function header. and also recycle.

holder.reset();
return;
}
deferred_deletes_.emplace_back(std::move(holder));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing tracks when this holder is eventually recyclable, right? Meaning, in this case it will stick around the deferred delete list until something else gets scheduled for cleanup? (Since presumably this schedule cleanup can race with the thing actually being recyclable). Maybe this is fine? Or maybe you should be checking for use count on the worker threads and scheduling cleanup if its detached and ready to be recycled (the last one)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's tracked:

  1. A cleanup task is scheduled when we put a new item into the deferred delete queue.
  2. The cleanup task will cleanup recyclable items from the Q, and reschedule another cleanup task(the recursion) if there are any non-recyclable ones.

So once an item is put into the queue, the above event-loop (1->2->2->2->....) will guarantee the items been cleaned up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stepping back though, can you refresh my memory on why this complexity is needed vs. just cancelling any potential updates before we start listener shutdown which I think would be serialized correctly? Thank you!

/wait

I think the canceling/waiting approach requires more changes to the code base. Basically, if we want to be lock free, we'd have to split the LDS update (or Server shutdown) into 2 phases:

  1. set an flag or broadcast canceling signals to workers.
  2. Wait to be signaled that the 'shutdown' message has been received by everyone.

In the meantime, if it's a LDS update caused ListenerImpl teardown, we have to bookkeep all the TLS resources owned by "this" ListenerImpl, and only wait on them to be cleaned up, as a RDS subscription (or any other subscription is shared between Listeners).

On the other hand, this change limit changes to tls module only, least disturbance to the code base.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, if we want to be lock free, we'd have to split the LDS update (or Server shutdown) into 2 phases

To be clear, this is how it already works. See ListenerManagerImpl::drainListener which already handles draining and stopping a listener, including posting a completion back and forth. Assuming all xDS updates are stopped before removal (or drain) by definition there can be no destruction race condition. What I'm trying to get at is how hard is it really to just shut down LDS/RDS/etc. updates for a listener on the main thread before doing this? Naively it doesn't seem that hard.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talked to @stevenzzzz offline and we are going to time box trying to implement explicit shutdown of resources prior to listener shutdown. This might be accomplished via the init manager or some similar structure. If it ends up being too hard or has complications we can come back to this.

/wait

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be in favor of not repeating something like init manager. Init manager is has turned out tobe hard to reason about, with a maze of callbacks, watchers, registration conditions, lifetime considerations and so on. History speaks for itself with all the bugs and races on this one.

I think there is a good argument to be made that the solution in this PR has benefits in that it simplifies the contract around slot ownership, e.g. the runOnAllThreads() gets the threads out of the business of worrying about tls_ slow ownership and lifetime. It also centralizes all the complexity inside of the SlotImpl, whereas init manager style solutions diffuse this complexity.

So, my $0.02 is that something like the solution in this PR is a nice direction; basically getting away from having threads worrying about TLS slots directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throwing out another idea that has come up when chatting with @stevenzzzz; we could have a centralized (owned by Server::Instance) TLS slot manager. This will be the actual owner of the memory backing all slots and it lives forever. It would hand out slot handles and then have them returned via RAII.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, my $0.02 is that something like the solution in this PR is a nice direction; basically getting away from having threads worrying about TLS slots directly.

I think with this solution there are going to be potentially hard to reason about thread interleavings, and I have no doubt we will have a different set of issues from this direction, though I agree it's a nice solution and it will work in the end. My reason for proposing the use of the existing init manager system is that it already exists. We must register things into the init manager, and I think it would be relatively trivial to also register a shutdown hook of some type at the same time.

With that said, I don't feel that strongly about it, so if you both want to continue with this solution we can do that.

we could have a centralized (owned by Server::Instance) TLS slot manager.

This is the way it already works?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good discussion here.
Personally I am not a fan of the init-manager mechanism. It's essentially a barrier that's passed around and hard to track what's been barriered and what's(should) not.

The other problem I want to call out here is that no matter which way we go, we need to let contributors know about this problem, and we should not distributed (by capturing some shared_ptr) ownership of a tls slot or its owner to worker threads.
That has the side effect of destructing the delicate dependency graph of objects in ServerImpl and cause crash during server shutdown (IIUC, per my experience though we drain the libevent message queue when shutdown a worker, we don't prevent new callbacks been posted to Dispatcher's callback queue). An example would be post a
a callback which captures a shared_ptr to workers. On destruction, The singletonManager is destructed first, and then the RdsConfigSubscription owned by RdsRouteConfigProvider will try to remove it self from the destructed RdsConfigProviderManager.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other problem I want to call out here is that no matter which way we go, we need to let contributors know about this problem, and we should not distributed (by capturing some shared_ptr) ownership of a tls slot or its owner to worker threads.

This is my point exactly. Although I agree Init/ShutdownManager is problematic in certain ways, its extremely explicit on what needs to happen. Basically, don't do any more updates. We are trading that for rules around how this can be used which are hard to understand and enforce. With that, I will stop there. @htuch I assume you want to move forward with this solution?

Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
@mattklein123
Copy link
Member

@stevenzzzz thanks for the fixes, but before we continue the review can we discuss my comment here:

Stepping back though, can you refresh my memory on why this complexity is needed vs. just cancelling any potential updates before we start listener shutdown which I think would be serialized correctly? Thank you!

I want to check and make sure we are picking the correct high level solution.

*/
void applyConfigUpdate(
const ConfigUpdateCb& update_fn, const Event::PostCb& complete_cb = []() {}) {
tls_->runOnAllThreads(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove tls_ as a member?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I follow your question, may I ask why?
tls_ is per subscription in config-provider framework, depends on config type (delta/non-delta), it creates per-worker/x-worker-shared ConfigImpl, which is shared among providers.

Copy link
Contributor Author

@stevenzzzz stevenzzzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx!

*/
void applyConfigUpdate(
const ConfigUpdateCb& update_fn, const Event::PostCb& complete_cb = []() {}) {
tls_->runOnAllThreads(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I follow your question, may I ask why?
tls_ is per subscription in config-provider framework, depends on config type (delta/non-delta), it creates per-worker/x-worker-shared ConfigImpl, which is shared among providers.

holder.reset();
return;
}
deferred_deletes_.emplace_back(std::move(holder));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good discussion here.
Personally I am not a fan of the init-manager mechanism. It's essentially a barrier that's passed around and hard to track what's been barriered and what's(should) not.

The other problem I want to call out here is that no matter which way we go, we need to let contributors know about this problem, and we should not distributed (by capturing some shared_ptr) ownership of a tls slot or its owner to worker threads.
That has the side effect of destructing the delicate dependency graph of objects in ServerImpl and cause crash during server shutdown (IIUC, per my experience though we drain the libevent message queue when shutdown a worker, we don't prevent new callbacks been posted to Dispatcher's callback queue). An example would be post a
a callback which captures a shared_ptr to workers. On destruction, The singletonManager is destructed first, and then the RdsConfigSubscription owned by RdsRouteConfigProvider will try to remove it self from the destructed RdsConfigProviderManager.

Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks. Flushing out a few more comments.

/wait

source/common/router/rds_impl.cc Outdated Show resolved Hide resolved
source/common/router/rds_impl.cc Outdated Show resolved Hide resolved
main_thread_dispatcher_->post([this]() {
deferred_deletes_.remove_if(
[](std::unique_ptr<SlotHolder>& holder) -> bool { return holder->isRecycleable(); });
if (!deferred_deletes_.empty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this makes more sense now. Can you add a bunch more comments inside recycle and this function to describe what is going on and the various interleavings?

source/common/thread_local/thread_local_impl.h Outdated Show resolved Hide resolved
source/common/thread_local/thread_local_impl.h Outdated Show resolved Hide resolved
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
@stevenzzzz
Copy link
Contributor Author

/assign alyssawilk

@alyssawilk PTAL, this PR is inspired by you suggested options.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM with some remaining comments.

Also, is it possible for you to add an integration test that crashes without this fix? That would be optimal. Thank you!

/wait

source/common/thread_local/thread_local_impl.h Outdated Show resolved Hide resolved
main_thread_dispatcher_->post([this]() {
deferred_deletes_.remove_if(
[](std::unique_ptr<SlotHolder>& holder) -> bool { return holder->isRecycleable(); });
if (!deferred_deletes_.empty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping on adding more comments.

Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM modulo clang-tidy fixes and in integration test for the original problem. Thank you!

/wait

Signed-off-by: Xin Zhuang <stevenzzz@google.com>
@stevenzzzz
Copy link
Contributor Author

Thanks, LGTM with some remaining comments.

Also, is it possible for you to add an integration test that crashes without this fix? That would be optimal. Thank you!

/wait

So previously the scoped_rds_integration_test is actually flaky, it fails around 1/3000 as there was deletion of a RDS resource immediately after a update to it, but the race window is really small.
I can add a regression test that will reproduce the race, but I don't a good way to reproduce the crash in the fixed code.

mattklein123
mattklein123 previously approved these changes Sep 7, 2019
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for iterating here and fully discussing the solution. Will defer to @alyssawilk for further review.

Copy link
Contributor

@lambdai lambdai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Bookeeper is inspiring WRT extending the ability of Slot.

I am not sure if the SlotHolder is too complex since it is kind of re-inventing shared_ptr. What's more the Bookeeper combining with SlotHolder is duplicating the deleter of shared_ptr.

My understanding of the bookkeeper is to track

  • how many runOnAllThread on the fly
  • The last of Bookkeeper destruction and runOnallThread should destruct the real slot

I propose to replace SlotHolder by shared_ptr, and bookeeper should
1 ref the shared_ptr until destruction (shared_ptr as a member)
2 increase the ref of shared_ptr at runOnAllThread (no more complex than capture it)
3. choose one from 3.1 and 3.2
3.1 BookKeeper set the deleter of the shared_ptr to moving to defer delete queue
3.2 rewrite the BookKeeper so that runOnAllThread enforce a oncomplete_cb to decrease the shared_ptr on main thread. See https://github.com/envoyproxy/envoy/pull/7011/files . 3.2 doesn't even need a defer delete queue.

My proposal not decrease the complexity of SlotHolder.
With 3.2 Bookkeeper::runOnAllThread could cut the overhead and the complexity of operation on deferred deletion.

WDYT?

bool isRecycleable() { return ref_count_.use_count() == 1; }

const std::unique_ptr<SlotImpl> slot_;
const std::shared_ptr<int> ref_count_{new int(0)};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shared_ptr as ref_count reminds me of the pattern I eliminated in envoy before. See #7011

if (deferred_deletes_.empty()) {
return;
}
main_thread_dispatcher_->post([this]() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this ought to be avoidable, but perhaps I'm missing something. We're using ref_count_ not as an integer, but just to have an object with a use count right?

If we made refcount_ a class, and had Bookkeeper have a reference to it (along with runOnAllThreads calls), and the refcount destructor did a dispatcher post(delete)) couldn't we avoid this extra layer of clean up complexity? We'd still delete when both the Bookkeeper and all thread references were gone, and we'd still delete in the original thread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the problem with "posting back to main-thread when the propagation is done", there again is this race-window, where the posted callback in main-thread may gets run first, and then the "copy of lambda" in the worker thread may held the last reference of the refcount_.

IMHO, We should also avoid to move the ownership of a slotImpl or its owner, by posting a callback/complete_cb which shares a portion of the object as well. That may cause complication when ServerImpl tears down, I went into a crash with that, where I posted shared_ptr back to main thread dispatcher queue, during shutdown, the SingletonManager is teared down ahead of the DispatcherImpl in ServerImpl, which includes the RouteConfigProviderManager that's referenced by the RdsRouteCnofigProvider in the dispatcher-msg-queue.

The Bookeeper is inspiring WRT extending the ability of Slot.

I am not sure if the SlotHolder is too complex since it is kind of re-inventing shared_ptr. What's more the Bookeeper combining with SlotHolder is duplicating the deleter of shared_ptr.

My understanding of the bookkeeper is to track

  • how many runOnAllThread on the fly
  • The last of Bookkeeper destruction and runOnallThread should destruct the real slot

I propose to replace SlotHolder by shared_ptr, and bookeeper should
1 ref the shared_ptr until destruction (shared_ptr as a member)
2 increase the ref of shared_ptr at runOnAllThread (no more complex than capture it)
3. choose one from 3.1 and 3.2
3.1 BookKeeper set the deleter of the shared_ptr to moving to defer delete queue
3.2 rewrite the BookKeeper so that runOnAllThread enforce a oncomplete_cb to decrease the shared_ptr on main thread. See https://github.com/envoyproxy/envoy/pull/7011/files . 3.2 doesn't even need a defer delete queue.

My proposal not decrease the complexity of SlotHolder.
With 3.2 Bookkeeper::runOnAllThread could cut the overhead and the complexity of operation on deferred deletion.

WDYT?

The Bookeeper is inspiring WRT extending the ability of Slot.

I am not sure if the SlotHolder is too complex since it is kind of re-inventing shared_ptr. What's more the Bookeeper combining with SlotHolder is duplicating the deleter of shared_ptr.

My understanding of the bookkeeper is to track

  • how many runOnAllThread on the fly
  • The last of Bookkeeper destruction and runOnallThread should destruct the real slot

I propose to replace SlotHolder by shared_ptr, and bookeeper should
1 ref the shared_ptr until destruction (shared_ptr as a member)
2 increase the ref of shared_ptr at runOnAllThread (no more complex than capture it)
3. choose one from 3.1 and 3.2
3.1 BookKeeper set the deleter of the shared_ptr to moving to defer delete queue
3.2 rewrite the BookKeeper so that runOnAllThread enforce a oncomplete_cb to decrease the shared_ptr on main thread. See https://github.com/envoyproxy/envoy/pull/7011/files . 3.2 doesn't even need a defer delete queue.

My proposal not decrease the complexity of SlotHolder.
With 3.2 Bookkeeper::runOnAllThread could cut the overhead and the complexity of operation on deferred deletion.

WDYT?

Yeah, SlotHolder is basically the obj that holds a slot and its outgoing ref-count. I made a class instead using a shared pointer as I thought it may be more "readable". the ferre

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, SingletonManager early destruction. With this restriction I don't have a good opinion...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked w/ Alyssa offline, get the idea now. I believe this matches what lamndai suggested as well.
The "schedule a cleanup callback" logic is removed and each Slot Bookkeeper fires a cleanup callback on the ref-count's destruction.

…t callback on destruction of the ref-count shared_ptr

Signed-off-by: Xin Zhuang <stevenzzz@google.com>
ASSERT(std::this_thread::get_id() == main_thread_id_);
ASSERT(slot != nullptr);
auto* slot_addr = slot.get();
deferred_deletes_.insert({slot_addr, std::move(slot)});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear on why we need this. Can't we have both Bookkeeper and threads have pointers to ref_count_, and have scheduleCleanup either straight up delete slot or schedule deletion via post? Why do we need the intermediate map?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we destruct the Bookkeeper, we can't delete the SlotImpl yet if the ref_count > 1, we need to detach the SlotImpl instance from its owner, and wait for the ref_count to go to 0 and make the post-call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked offline, Alyssa suggested to remove the intermediate storage (deferred_deletes_ ) as well, IIUC, by: releasing the SlotImpl from the unique_ptr, having the ref_count shared_ptr post a callback on destruction to the Main dispatcher. There is a risk of leaking at shutdown time though: in Server::InstanceImpl::terminate(), workers are stopped after main dispatcher loop is exited, if a worker posts a cleanup callback, the callback will not be run, the SlotImpl instance pointed to by the raw pointer will be leaked. So the intermediate container is kinda unavoidable.

Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
…achable-slot

Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Copy link
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo the spelling error.

As an aside, I think we need to do some infrastructure work to make the cross-thread deletion need less rigmarole, but I think that's outside of the scope of this PR. Thanks for all the cleanup you were willing to do!

…achable-slot

Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
@stevenzzzz
Copy link
Contributor Author

/wait

@alyssawilk alyssawilk merged commit c5ffdda into envoyproxy:master Sep 11, 2019
danzh2010 pushed a commit to danzh2010/envoy that referenced this pull request Sep 24, 2019
…llThreads interface to Slot. (envoyproxy#8135)

See the issue in envoyproxy#7902, this PR is to make the SlotImpl detachable from its owner, by introducing a Booker object wraps around a SlotImpl, which bookkeeps all the on-the-fly update callbacks. And on its destruction, if there are still on-the-fly callbacks, move the SlotImpl to an deferred-delete queue, instead of destructing the SlotImpl which may cause an SEGV error.

More importantly, introduce a new runOnAllThreads(ThreadLocal::UpdateCb cb) API to Slot, which requests a Slot Owner to not assume that the Slot or its owner will out-live (in Main thread) the fired on-the-fly update callbacks, and should not capture the Slot or its owner in the update_cb.

Picked RDS and config-providers-framework as examples to demonstrate that this change works. {i.e., changed from the runOnAllThreads(Event::PostCb) to the new runOnAllThreads(TLS::UpdateCb) interface. }

Risk Level: Medium
Testing: unit test
Docs Changes: N/A
Release Notes: N/A
[Optional Fixes #Issue] envoyproxy#7902

Signed-off-by: Xin Zhuang <stevenzzz@google.com>
danzh2010 pushed a commit to danzh2010/envoy that referenced this pull request Oct 4, 2019
…llThreads interface to Slot. (envoyproxy#8135)

See the issue in envoyproxy#7902, this PR is to make the SlotImpl detachable from its owner, by introducing a Booker object wraps around a SlotImpl, which bookkeeps all the on-the-fly update callbacks. And on its destruction, if there are still on-the-fly callbacks, move the SlotImpl to an deferred-delete queue, instead of destructing the SlotImpl which may cause an SEGV error.

More importantly, introduce a new runOnAllThreads(ThreadLocal::UpdateCb cb) API to Slot, which requests a Slot Owner to not assume that the Slot or its owner will out-live (in Main thread) the fired on-the-fly update callbacks, and should not capture the Slot or its owner in the update_cb.

Picked RDS and config-providers-framework as examples to demonstrate that this change works. {i.e., changed from the runOnAllThreads(Event::PostCb) to the new runOnAllThreads(TLS::UpdateCb) interface. }

Risk Level: Medium
Testing: unit test
Docs Changes: N/A
Release Notes: N/A
[Optional Fixes #Issue] envoyproxy#7902

Signed-off-by: Xin Zhuang <stevenzzz@google.com>
danzh2010 pushed a commit to danzh2010/envoy that referenced this pull request Oct 4, 2019
…llThreads interface to Slot. (envoyproxy#8135)

See the issue in envoyproxy#7902, this PR is to make the SlotImpl detachable from its owner, by introducing a Booker object wraps around a SlotImpl, which bookkeeps all the on-the-fly update callbacks. And on its destruction, if there are still on-the-fly callbacks, move the SlotImpl to an deferred-delete queue, instead of destructing the SlotImpl which may cause an SEGV error.

More importantly, introduce a new runOnAllThreads(ThreadLocal::UpdateCb cb) API to Slot, which requests a Slot Owner to not assume that the Slot or its owner will out-live (in Main thread) the fired on-the-fly update callbacks, and should not capture the Slot or its owner in the update_cb.

Picked RDS and config-providers-framework as examples to demonstrate that this change works. {i.e., changed from the runOnAllThreads(Event::PostCb) to the new runOnAllThreads(TLS::UpdateCb) interface. }

Risk Level: Medium
Testing: unit test
Docs Changes: N/A
Release Notes: N/A
[Optional Fixes #Issue] envoyproxy#7902

Signed-off-by: Xin Zhuang <stevenzzz@google.com>
jplevyak pushed a commit to jplevyak/envoy that referenced this pull request Oct 10, 2019
…llThreads interface to Slot. (envoyproxy#8135)

See the issue in envoyproxy#7902, this PR is to make the SlotImpl detachable from its owner, by introducing a Booker object wraps around a SlotImpl, which bookkeeps all the on-the-fly update callbacks. And on its destruction, if there are still on-the-fly callbacks, move the SlotImpl to an deferred-delete queue, instead of destructing the SlotImpl which may cause an SEGV error.

More importantly, introduce a new runOnAllThreads(ThreadLocal::UpdateCb cb) API to Slot, which requests a Slot Owner to not assume that the Slot or its owner will out-live (in Main thread) the fired on-the-fly update callbacks, and should not capture the Slot or its owner in the update_cb.

Picked RDS and config-providers-framework as examples to demonstrate that this change works. {i.e., changed from the runOnAllThreads(Event::PostCb) to the new runOnAllThreads(TLS::UpdateCb) interface. }

Risk Level: Medium
Testing: unit test
Docs Changes: N/A
Release Notes: N/A
[Optional Fixes #Issue] envoyproxy#7902

Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Signed-off-by: John Plevyak <jplevyak@gmail.com>
PiotrSikora pushed a commit to PiotrSikora/envoy that referenced this pull request Oct 10, 2019
* ext_authz: add metadata_context to ext_authz filter (envoyproxy#7818)

This adds the ability to specify dynamic metadata (by namespace) to
send with the ext_authz check request. This allows one filter to
specify information that can be then used in evaluating an
authorization decision.

Risk Level: Medium. Optional feature/extension of existing filter
Testing: Unit testing
Docs Changes: Inline in attribute_context.proto and ext_authz.proto

Fixes envoyproxy#7699

Signed-off-by: Ben Plotnick <plotnick@yelp.com>

* fuzz: codec impl timeout fix + speed ups (envoyproxy#7963)

Some speed-ups and validations for codec impl fuzz test:

* validate actions aren't empty (another approach would be to scrub / clean these)
* limit actions to 1024
* require oneofs

Fixes OSS-Fuzz Issue:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=16481
Testing: local asan/libfuzzer exec/sec go from 25 to 50

Signed-off-by: Asra Ali <asraa@google.com>

* docs: more detail about tracking down deprecated features (envoyproxy#7972)

Risk Level: n/a (docs only)
Testing: n/a
Docs Changes: yes
Release Notes: no
envoyproxy#7945

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>

* Fix the alignement in optval of setsockopt when compiled with libc++. (envoyproxy#7958)

Description:
libc++ std::string may inline the data which results the memory is not
aligned to `void*`. Use vector instead to store the optval.

Detected by UBSAN with libc++ config. Preparation for envoyproxy#4251

Risk Level: Low
Testing: unittest locally
Docs Changes: N/A
Release Notes: N/A
Fixes envoyproxy#7968 

Signed-off-by: Lizan Zhou <lizan@tetrate.io>

* security: some intra-entity and 3rd party embargo clarifications. (envoyproxy#7977)

* security: some intra-entity and 3rd party embargo clarifications.

These came up in the last set of CVEs.

Signed-off-by: Harvey Tuch <htuch@google.com>

* protobuf: IWYU (envoyproxy#7989)

Include What You Use fix for source/common/protobuf/message_validator_impl.h.

Signed-off-by: Andres Guedez <aguedez@google.com>

* api: add name into filter chain (envoyproxy#7966)

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* rds: validate config in depth before update config dump (envoyproxy#7956)

Route config need deep validation for virtual host duplication check, regex check, per filter config validation etc, which PGV wasn't enough.

Risk Level: Low
Testing: regression test
Docs Changes: N/A
Release Notes: N/A

Fixes envoyproxy#7939

Signed-off-by: Lizan Zhou <lizan@tetrate.io>

* tls: maintain a free slot index set in TLS InstanceImpl to allocate in O(1… (envoyproxy#7979)

Signed-off-by: Xin Zhuang <stevenzzz@google.com>

* redis: handle invalid ip address from cluster slots and added tests (envoyproxy#7984)

Signed-off-by: Henry Yang <hyang@lyft.com>

* protobuf: report field numbers for unknown fields. (envoyproxy#7978)

Since binary proto won't have field names, report at least the field
numbers, as per
https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.unknown_field_set#UnknownField.

Also fix minor typo encountered while doing this work.

Risk level: Low
Testing: Unit tests added/updated.

Fixes envoyproxy#7937

Signed-off-by: Harvey Tuch <htuch@google.com>

* Content in envoy docs does not cover whole page (envoyproxy#7993)

Signed-off-by: Manish Kumar <manishjpiet@gmail.com>

* stats: Add option to switch between fake and real symbol-tables on the command-line. (envoyproxy#7882)

* Add option to switch between fake and real symbol-tables on the command-line.

Signed-off-by: Joshua Marantz <jmarantz@google.com>

* api config: add build rules for go protos (envoyproxy#7987)

Some BUILD files are missing build rules to generate go protos. envoyproxy/go-control-plane depends on these protos, so they should be exposed publicly. Added build rules to generate *.pb.go files.

Risk Level: Low
Testing: These rules were copied to google3 and tested internally. Unfortunately, I am having a bit of trouble with bazel build directly on these targets ("Package is considered deleted due to --deleted_packages"). Please let me know if there is a better way to test this change.

Signed-off-by: Teju Nareddy <nareddyt@google.com>

* test: don't use <experimental/filesystem> on macOS. (envoyproxy#8000)

Xcode 11 requires at least macOS 10.15 (upcoming) in order to use
either <experimental/filesystem> or C++17 <filesystem>.

Signed-off-by: Piotr Sikora <piotrsikora@google.com>

*  event: adding the capability of creating an alarm with a given scope (envoyproxy#7920)

Precursor to envoyproxy#7782
Adding scope tracking functionality to the basic alarm functions.

Risk Level: Medium (should be a no-op but is a large enough refactor)
Testing: new unit tests
Docs Changes: n/a
Release Notes: n/a

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>

* ext authz: add dns san support for ext authz service (envoyproxy#7948)

Adds support for DNS SAN in ext authz peer validation

Risk Level: Low
Testing: Added
Docs Changes: Added
Release Notes: N/A

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

* accesslog: don't open log file with read flag (envoyproxy#7998)

Description:
File access log shouldn't need read access for a file.

Risk Level: Low
Testing: local in mac, CI
Docs Changes:
Release Notes:
Fixes envoyproxy#7997

Signed-off-by: Lizan Zhou <lizan@tetrate.io>

* protobuf: towards unifying PGV, deprecated and unknown field validation. (envoyproxy#8002)

This is part of envoyproxy#7980; basically, we want to leverage the recursive pass
that already exists for the deprecated check. This PR does not implement
the recursive behavior yet for unknown fields though, because there is a
ton of churn, so this PR just has the mechanical bits. We switch
plumbing of validation visitor into places such as anyConvert() and
instead pass this to MessageUtil::validate.

There are a bunch of future followups planned in additional PRs:
* Combine the recursive pass for unknown/deprecated check in
  MessageUtil::validate().
* Add mitigation for envoyproxy#5965 by copying to a temporary before recursive
  expansion.
* [Future] consider moving deprecated reporting into a message
  validation visitor handler.

Risk level: Low
Testing: Some new //test/common/protobuf::utility_test unit test.

Signed-off-by: Harvey Tuch <htuch@google.com>

* http: forwarding x-forwarded-proto from trusted proxies (envoyproxy#7995)

Trusting the x-forwarded-proto header from trusted proxies.
If Envoy is operating as an edge proxy but has a trusted hop in front, the trusted proxy should be allowed to set x-forwarded-proto and its x-forwarded-proto should be preserved.
Guarded by envoy.reloadable_features.trusted_forwarded_proto, default on.

Risk Level: Medium (L7 header changes) but guarded
Testing: new unit tests
Docs Changes: n/a
Release Notes: inline
Fixes envoyproxy#4496

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>

* build: adding an option to hard-fail when deprecated config is used. (envoyproxy#7962)

Adding a build option to default all deprecated protos off, and using it on the debug build.

Risk Level: Low
Testing: new UT
Docs Changes: inline
Release Notes: n/a
Fixes envoyproxy#7548

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>

* envoy_cc_library: add export of foo_with_external_headers (envoyproxy#8005)

Add a parallel native.cc_library to envoy_cc_library
for external projects that consume Envoy's libraries. This allows the consuming
project to disambiguate overlapping include paths when repository overlaying is used,
as it can now include envoy headers via external/envoy/...

Risk Level: Low
Testing: N/A

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>

* ci: add fuzz test targets to ci (envoyproxy#7949)

Builds fuzz targets with asan+libfuzzer and runs them against their corpora. Our native bazel builds work, this PR integrates the asan+libfuzzer builds in to CI. The fuzz target binaries will be in your envoy docker build directory.

Invoke with the following for all fuzz targets, or a specified one.
./ci/run_envoy_docker.sh './ci/do_ci.sh bazel.fuzz'
./ci/run_envoy_docker.sh './ci/do_ci.sh bazel.fuzz //test/common/common:utility_fuzz_test'

Risk level: low
Signed-off-by: Asra Ali asraa@google.com

Signed-off-by: Asra Ali <asraa@google.com>

* tls: support BoringSSL private key async functionality (envoyproxy#6326)

This PR adds BoringSSL private key API abstraction, as discussed in envoyproxy#6248. All comments and discussion is welcomed to get the API sufficient for most private key API tasks.

The PR contains the proposed API and the way how it can be used from ssl_socket.h. Also there is some code showing how the PrivateKeyMethodProvider is coming from TLS certificate config. Two example private key method providers are included in the tests.

Description: tls: support BoringSSL private key async functionality
Risk Level: medium
Testing: two basic private key provider implementation
Docs Changes: TLS arch doc, cert.proto doc

Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com>

* use SymbolTableCreator rather than fakes in a few stray places. (envoyproxy#8006)

stats: use SymbolTableCreator rather than fakes in a few stray places. (envoyproxy#8006)

Signed-off-by: Joshua Marantz <jmarantz@google.com>

* [router] Add SRDS configUpdate impl (envoyproxy#7451)

This PR contains changes on the xRDS side for SRDS impl, cribbed from http://go/gh/stevenzzzz/pull/8/files#diff-2071ab0887162eac1fd177e89d83175a

* Add onConfigUpdate impl for SRDS subscription
* Remove scoped_config_manager as it's not used now.
* Move ScopedConfigInfo to scoped_config_impl.h/cc
* Add a hash to scopeKey and scopeKeyFragment, so we can look up scopekey by hash value in constant time when SRDS has many scopes.
* Add a initManager parameter to RDS createRdsRouteConfigProvider API interface, when creating RouteConfigProvider after listener/server warmed up, we need to specify a different initManager than the one from factoryContext to avoid an assertion failure. see related:envoyproxy#7617

This PR only latches a SRDS provider into the connection manager, the "conn manager using SRDS to make route decision" plus integration tests will be covered in a following PR.

Risk Level: LOW [not fully implemented].
Testing: unit tests

Signed-off-by: Xin Zhuang <stevenzzz@google.com>

* Fix version history (envoyproxy#8021)

Follow-up for envoyproxy#7995.

Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>

* tools: sync tool for envoyproxy/assignable team. (envoyproxy#8015)

Bulk update of team to match envoyproxy organization. While at it, cleaned up some venv stuff in
shell_utils.sh.

Risk level: Low
Testing: Synced 157 members from envoyproxy to envoyproxy/assignable.

Signed-off-by: Harvey Tuch <htuch@google.com>

* redis: fix onHostHealthUpdate got called before the cluster is resolved. (envoyproxy#8018)

Signed-off-by: Henry Yang <hyang@lyft.com>

* api/build: migrate UDPA proto tree to external cncf/udpa repository. (envoyproxy#8017)

This is a one-time movement of all UDPA content from envoyproxy/envoy to
cncf/udpa. The permanent home of UDPA will be
https://github.com/cncf/udpa.

Risk level: Low
Testing: Added UDPA service entry to build_test.

Signed-off-by: Harvey Tuch <htuch@google.com>

* http: tracking active session under L7 timers (envoyproxy#7782)

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>

* upstream: remove thread local cluster after triggering call backs (envoyproxy#8004)

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

* upstream: Introducing close_connections_on_host_set_change property (envoyproxy#7675)

Signed-off-by: Kateryna Nezdolii <nezdolik@spotify.com>

* upstream: delete stale TODO (envoyproxy#8028)

This was fixed in envoyproxy#7877

Signed-off-by: Matt Klein <mklein@lyft.com>

* Enhance comment about MonotonicTime (envoyproxy#8011)

Depending on the execution environment in which envoy is being run, it
is possible that some of the assumption on the clock are maybe not
holding as previously commented. With some sandboxing technologies the
clock does not reference the machine boot time but the sandbox boot
time. This invalidates the assumtpion that the first update in the
cluster_manager will most likely fall out of the windows and ends up
showing a non intuitive behavior difficult to catch.
This PR simply adds a comment that will allow the reader to consider
this option while reading to the code.

Signed-off-by: Flavio Crisciani <f.crisciani@gmail.com>

* build: some missing dep fixups for Google import. (envoyproxy#8026)

Signed-off-by: Harvey Tuch <htuch@google.com>

* introduce safe regex matcher based on re2 engine (envoyproxy#7878)

The libstdc++ std::regex implementation is not safe in all cases
for user provided input. This change deprecates the used of std::regex
in all user facing paths and introduces a new safe regex matcher with
an explicitly configurable engine, right now limited to Google's re2
regex engine. This is not a drop in replacement for std::regex as all
language features are not supported. As such we will go through a
deprecation period for the old regex engine.

Fixes envoyproxy#7728

Signed-off-by: Matt Klein <mklein@lyft.com>

* docs: reorganize configuration tree (envoyproxy#8027)

This is similar to what I did for the arch overview a while ago as
this section is also getting out of control.

Signed-off-by: Matt Klein <mklein@lyft.com>

* build: missing regex include. (envoyproxy#8032)

Signed-off-by: Harvey Tuch <htuch@google.com>

* [headermap] speedup for appending data (envoyproxy#8029)

For debug builds, performance testing and fuzzers reveal that when appending to a header, we scan both the existing value and the data to append for invalid characters. This PR moves the validation check to just the data that is appended, to avoid hangups on re-scanning long header values multiple times.

Testing: Added corpus entry that reveals time spent in validHeaderString

Signed-off-by: Asra Ali <asraa@google.com>

* eds: avoid send too many ClusterLoadAssignment requests  (envoyproxy#7976)

During initializing secondary clusters, for each initialized cluster, a ClusterLoadAssignment
request is sent to istio pilot with the cluster's name appended into request's resource_names
list. With a huge number of clusters(e.g 10k clusters), this behavior slows down Envoy's
initialization and consumes ton of memory. This change pauses ADS mux for ClusterLoadAssignment to avoid that.

Risk Level: Medium
Testing: tiny change, no test case added

Fixes envoyproxy#7955

Signed-off-by: lhuang8 <lhuang8@ebay.com>

* Set the bazel verison to 0.28.1 explicitly (envoyproxy#8037)

In theopenlab/openlab-zuul-jobs#622 , the OpenLab add the ability to set the bazel to specific version explicitly. This patch add the bazel role for the envoy job.

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>

* Read_policy is not set correctly. (envoyproxy#8034)

Add more integration test and additional checks in the unit tests.

Signed-off-by: Henry Yang <hyang@lyft.com>

* admin: fix /server_info hot restart version (envoyproxy#8022)

Signed-off-by: Matt Klein <mklein@lyft.com>

* test: adding debug hints for integration test config failures (envoyproxy#8038)

Risk Level: n/a (test only)
Testing: manual
Docs Changes: n/a
Release Notes: n/a

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>

* udp_listener: refactor ActiveUdpListener creation (envoyproxy#7884)

Signed-off-by: Dan Zhang <danzh@google.com>

* accesslog: implement TCP gRPC access logger (envoyproxy#7941)

Description:
Initial implementation for TCP gRPC access logger.

Risk Level: Low (extension only)
Testing: integration test
Docs Changes: Added
Release Notes: Added

Signed-off-by: Lizan Zhou <lizan@tetrate.io>

* tracing: add OpenCensus agent exporter support to OpenCensus driver. (envoyproxy#8023)

Signed-off-by: Emil Mikulic <g-easy@users.noreply.github.com>

* Exporting platform_impl_lib headers (envoyproxy#8045)

This allows consuming projects using repository overlaying to disambiguate overlapping include paths when it comes to platform_impl.h by going through envoy/external/...

Addendum to envoyproxy#8005

Risk Level: Low
Testing: N/A

Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com>

* access_log: minimal log file error handling (envoyproxy#7938)

Rather than ASSERT for a reasonably common error condition
(e.g. disk full) record a stat that indicates log file writing
failed. Also fixes a test race condition.

Risk Level: low
Testing: added stats checks
Docs Changes: documented new stat
Release Notes: updated

Signed-off-by: Stephan Zuercher <zuercher@gmail.com>

* tracing: add grpc-status and grpc-message to spans (envoyproxy#7996)

Signed-off-by: Caleb Gilmour <caleb.gilmour@datadoghq.com>

* fuzz: add bounds to statsh flush interval (envoyproxy#8043)

Add PGV bounds to the stats flush interval (greater than 1ms and less than 5000ms) to prevent Envoy from hanging from too small of a flush time.

Risk Level: Low
Testing: Corpus Entry added
Fixes OSS-Fuzz issue
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=16300

Signed-off-by: Asra Ali <asraa@google.com>

* Improve tools/stack_decode.py (envoyproxy#8041)

Adjust tools/stack_decode.py to more obviously be Python 2 (not 3), and to work on stack traces that don't include the symbol names.

Risk Level: Low
Testing: Manually tested on a stack trace that one of our users sent us

Signed-off-by: Luke Shumaker <lukeshu@datawire.io>

* build: tell googletest to use absl stacktrace (envoyproxy#8047)

Description:
https://github.com/google/googletest/blob/d7003576dd133856432e2e07340f45926242cc3a/BUILD.bazel#L42

Risk Level: Low (test only)
Testing: CI
Docs Changes:
Release Notes:

Signed-off-by: Lizan Zhou <lizan@tetrate.io>

* Update references to local scripts to enable using build container for filter repos (envoyproxy#7907)

Description: This change enables using run_envoy_docker.sh to build envoy-filter-example
Risk Level: Low
Testing: Manually tested building envoy-filter-example using: envoy/ci/run_envoy_docker.sh './ci/do_ci.sh build'
Docs Changes: N/A
Release Notes: N/A

Signed-off-by: Santosh Kumar Cheler <scheler@arubanetworks.com>

* bazel: patch gRPC to fix Envoy builds with glibc v2.30 (envoyproxy#7971)

Description: the latest glibc (v2.30) declares its own `gettid()` function (see [0]) and this creates a naming conflict in gRPC which has a function with the same name.

Apply to gRPC [a patch](grpc/grpc#18950) which renames `gettid()` to `sys_gettid()`.

[0] https://sourceware.org/git/?p=glibc.git;a=commit;h=1d0fc213824eaa2a8f8c4385daaa698ee8fb7c92

Risk Level: low
Testing: unit tests
Docs Changes: n/a
Release Notes: n/a

Signed-off-by: Dmitry Rozhkov <dmitry.rozhkov@linux.intel.com>

* build: link C++ stdlib dynamically in sanitizer runs (envoyproxy#8019)

Description:
Sanitizers doesn't support static link, reverts envoyproxy#7929 and link lib(std)c++ dynamically in sanitizer runs. Addresses test issue for envoyproxy#4251. Added workaround in ASAN for envoyproxy#7647.

Risk Level: Low (test only)
Testing: CI, local libc++ runs
Docs Changes: N/A
Release Notes: N/A
Fixes envoyproxy#7928

* test: cleaning up test runtime (envoyproxy#8012)

Using the new runtime utility to clean up a bunch of test gorp. Yay utils!

Risk Level: n/a (test only)
Testing: tests pass
Docs Changes: n/a
Release Notes: n/a
Signed-off-by: Alyssa Wilk <alyssar@chromium.org>

*  test: improved coverage and handling of deprecated config (envoyproxy#8057)

Making ENVOY_DISABLE_DEPRECATED_FEATURES work for unit tests without runtime configured.
Fixing up a handful of unit tests to remove legacy code or use the handy
DEPRECATED_FEATURE_TEST macro
Adding back coverage of cors.enabled() and redis.catch_all_route()

Risk Level: Low (test only)
Testing: new unit tests
Docs Changes: n/a
Release Notes: n/a
Fixes envoyproxy#8013
Fixes envoyproxy#7548

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>

* [Docs typo] Remote Executioon -> Remote Execution (envoyproxy#8061)

Fixes mispelling of `Executioon` -> `Execution`

Signed-off-by: Colin Schoen <schoen@yelp.com>

* api: Fix duplicate java_outer_classname declarations (envoyproxy#8059)

The java_outer_classname is unintentionally duplicated in the new
udp_listener_config and regex proto files. This changes them to unique
names that match the predominant naming scheme.

Signed-off-by: Bryce Anderson <banderson@twitter.com>

* http: making the behavior of the response Server header configurable (envoyproxy#8014)

Default behavior remains unchanged, but now Envoy can override, override iff there's no server header from upstream, or always leave the server header (or lack thereof) unmodified.

Risk Level: low (config guarded change)
Testing: new unit tests
Docs Changes: n/a
Release Notes: inline
Fixes envoyproxy#6716

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>

* use bazelversion for filter-example too (envoyproxy#8069)

Signed-off-by: Lizan Zhou <lizan@tetrate.io>

* grpc-httpjson-transcode: Update for RFC2045 support (envoyproxy#8065)

RFC2045 (MIME) Base64 decoding support has been fixed upstream

Description: The grpc transcoding filter has been updated to support RFC2045 (MIME) based inputs for protobuf type "Bytes". This is important since Base64 is often using the RFC2045 format for inputs.
Also see: grpc-ecosystem/grpc-httpjson-transcoding#34

Risk Level: Low
Testing: Integration / Manual Tests
Docs Changes: N/A
Release Notes: N/A

Signed-off-by: Hans Viken Duedal <hans.duedal@visma.com>

* stats: Clean up all calls to Scope::counter() et al in production code. (envoyproxy#7842)

* Convert a few more counter() references to use the StatName interface.

Signed-off-by: Joshua Marantz <jmarantz@google.com>

* tls_inspector: inline the recv in the onAccept (envoyproxy#7951)

Description:
As discussed in envoyproxy#7864 this PR is the attempt to peek the socket at the invoke of onAccept.
Usually client_hello packet should be in the buffer when tls_inspector is peeking, we could save a poll cycle for this connection.

Once we agree on the solution I can apply to http_inspector as well.

The expecting latency improvement especially when poll cycle is large.

Benchmark:
Env:
hardware Intel(R) Xeon(R) CPU @ 2.20GHz
envoy: concurrency = 1, tls_inspector as listener filter. One tls filter chain, and one plain text filter chain.
load background: a [sniper](https://github.com/lubia/sniper) client with concurrency = 5 hitting the server with tls handshake, aiming to hit using the tls_filter chain. The qps is about 170/s
Another load client hitting the plain text filter chain but would go through tls_inspector with concurrency = 1

This PR: 
TransactionTime:              10.3 - 11.0 ms(mean)
Master                
TransactionTime:              12.3 - 12.8 ms(mean)

Risk Level: Med (ActiveSocket code is affected to adopt the side effect of onAccept)
Testing: 
Docs Changes:
Release Notes:
Fixes envoyproxy#7864

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

* Fixes gcc 8.3.1 build failure due to FilterChainBenchmarkFixture::SetUp hiding base-class virtual functions (envoyproxy#8071)

Description: I'm seeing "bazel-out/k8-fastbuild/bin/external/com_github_google_benchmark/_virtual_includes/benchmark/benchmark/benchmark.h:1071:16: error: 'virtual void benchmark::Fixture::SetUp(benchmark::State&)' was hidden" when running tests. This resolves the issue with hiding of the base-class functions.
Risk Level: low
Testing:
Docs Changes:
Release Notes:

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>

* test: fix ups for various deprecated fields (envoyproxy#8068)

Takeaways: we've lost the ability to do empty regex (which was covered in router tests and is proto constraint validated on the new safe regex) as well as negative lookahead (also covered in tests) along with a host of other things conveniently documented as not supported here: https://github.com/google/re2/wiki/Syntax

Otherwise split up a bunch of tests, duplicated and tagged a bunch of tests, and cleaning up after we finally can remove deprecated fields again will be an order of magnitude easier.

Also fixing a dup relnote from envoyproxy#8014

Risk Level: n/a (test only)
Testing: yes. yes there is.
Docs Changes: no
Release Notes: no

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>

* include: add log dependency header to connection_handler.h (envoyproxy#8072)

Signed-off-by: Teju Nareddy <nareddyt@google.com>

* quiche: Update QUICHE dep (envoyproxy#8044)

Update QUICHE tar ball to 4abb566fbbc63df8fe7c1ac30b21632b9eb18d0c.
Add some new impl's for newly added api.

Risk Level: low
Testing: using quiche build in tests.
Part of envoyproxy#2557

Signed-off-by: Dan Zhang <danzh@google.com>

* tools: deprecated field check in Route Checker tool (envoyproxy#8058)

We need a way to run the deprecated field check on the RouteConfiguration. Today the schema check tool validates the bootstrap config. This change will help achieve similar functionality on routes served from rds.
Risk Level: Low
Testing: Manual testing
Docs Changes: included
Release Notes: included

Signed-off-by: Jyoti Mahapatra <jmahapatra@lyft.com>

* tracing: Add support for sending data in Zipkin v2 format (envoyproxy#6985)

Description: This patch supports sending a list of spans as JSON v2 and protobuf message over HTTP to Zipkin collector. [Sending protobuf](https://github.com/openzipkin/zipkin-api/blob/0.2.1/zipkin.proto) is considered to be more efficient than JSON, even compared to the v2's JSON (openzipkin/zipkin#2589 (comment)). This is an effort to rework envoyproxy#6798.

The approach is by serializing the v1 model to both v2 JSON and protobuf.

Risk Level: Low, since the default is still HTTP-JSON v1 based on https://github.com/openzipkin/zipkin-api/blob/0.2.2/zipkin-api.yaml.
Testing: Unit testing, manual integration test with real Zipkin collector server.
Docs Changes: Added
Release Notes: Added
Fixes: envoyproxy#4839

Signed-off-by: Dhi Aurrahman <dio@tetrate.io>
Signed-off-by: José Carlos Chávez <jcchavezs@gmail.com>

* Route Checker tool Fix code coverage bug in proto based schema (envoyproxy#8101)

Signed-off-by: Jyoti Mahapatra <jmahapatra@lyft.com>

* [hcm] Add scoped RDS routing into HCM (envoyproxy#7762)

Description: add Scoped RDS routing logic into HCM. Changes include:

* in ActiveStream constructor latch a ScopedConfig impl to the activeStream if SRDS is enabled
* in the beginning of ActiveStream::decodeHeaders(headers, end_stream), get routeConfig from latched ScopedConfig impl.

This PR is the 3rd in the srds impl PR chain: [envoyproxy#7704, envoyproxy#7451, this].

Risk Level: Medium
Testing: unit test and integration tests.
Release Notes: Add scoped RDS routing support into HCM.

Signed-off-by: Xin Zhuang <stevenzzz@google.com>

* owners: add @asraa and @lambdai to OWNERS. (envoyproxy#8110)

* @asraa is joining Envoy OSS security team.

* @lambdai is joining Friends of Envoy as v2 xDS point.

Signed-off-by: Harvey Tuch <htuch@google.com>

* protobuf: recursively validate unknown fields. (envoyproxy#8094)

This PR unifies the recursive traversal of deprecated fields with that of unknown fields. It doesn't
deal with moving to a validator visitor model for deprecation; this would be a nice cleanup that we
track at envoyproxy#8092.

Risk level: Low
Testing: New nested unknown field test added.

Fixes envoyproxy#7980

Signed-off-by: Harvey Tuch <htuch@google.com>

* Fuzz reuse (envoyproxy#8119)

This PR allows the envoy_cc_fuzz_test rule to be used when pulling in envoy. which can be useful when you're writing filters for envoy, and want to reuse the fuzzing architecture envoy has already built. other rules already allow for this (see envoy_cc_test in this same file for example).

Risk Level: Low
Testing:

Testing the Old Rule Still Works

It is possible to test the old rules still work (even without specifying a repository), by simply choosing your favorite fuzz test, and choosing to run bazel test on it. For example: bazel test //test/common/router:header_parser_fuzz_test. Any envoy_cc_fuzz_test rule should do.

Testing New Rules Work

I've done testing inside my own repository, but if you want to create your own test rule you can probably do the following in envoy-filter-example:

Checkout envoy-filter-example, and update the envoy submodule to this pr.
Follow the directions in: test/fuzz/README.md to define a envoy_cc_fuzz_test rule. Make sure to add a line for: repository = "@envoy" which is the new argument being added.
You should be able to run the fuzz test.

Signed-off-by: Cynthia Coan <ccoan@instructure.com>

* Set INCLUDE_DIRECTORIES so libcurl can find local urlapi.h (envoyproxy#8113)

Fixes envoyproxy#8112

Signed-off-by: John Millikin <jmillikin@stripe.com>

* cleanup: move test utility methods in ScopedRdsIntegrationTest to base class HttpIntegrationTest (envoyproxy#8108)

Fixes envoyproxy#8050
Risk Level: LOW [refactor only]

Signed-off-by: Xin Zhuang <stevenzzz@google.com>

* upstream: fix invalid access of ClusterMap iterator during warming cluster modification (envoyproxy#8106)

Risk Level: Medium
Testing: New unit test added. Fix verified via --config=asan.

Signed-off-by: Andres Guedez <aguedez@google.com>

* api:Add a flag to disable overprovisioning in ClusterLoadAssignment (envoyproxy#8080)

* api:Add a flag to disable overprovisioning in ClusterLoadAssignment

Signed-off-by: Jie Chen <jiechen@google.com>

* api:Add [#next-major-version and [#not-implemented-hide to the comment
for field of disable_overprovisioning in ClusterLoadAssignment
Signed-off-by: Jie Chen <jiechen@google.com>

* api:Refine comments for the new added bool flag as suggested.
Signed-off-by: Jie Chen <jiechen@google.com>

* api: clone v2[alpha] to v3alpha. (envoyproxy#8125)

This patch establishes a v3alpha baseline API, by doing a simple copy of
v2[alpha] dirs and some sed-style heuristic fixups of BUILD dependencies
and proto package namespaces.

The objective is provide a baseline which we can compare the output from
tooling described in envoyproxy#8083 in later PRs, providing smaller visual diffs.

The core philosophy of the API migration is that every step will be
captured in a script (at least until the last manual steps),
api/migration/v3alpha.sh. This script will capture deterministic
migration steps, allowing v2[alpha] to continue to be updated until we
finalize v3.

There is likely to be significant changes, e.g. in addition to the work
scoped for v3, we might want to reduce the amount of API churn by
referring back to v2 protos where it makes sense. This will be done via
tooling in later PRs.

Part of envoyproxy#8083.

Risk level: Low
Testing: build @envoy_api//...

Signed-off-by: Harvey Tuch <htuch@google.com>

* dubbo: Fix heartbeat packet parsing error (envoyproxy#8103)

Description: 
The heartbeat packet may carry data, and it is treated as null data when processing the heartbeat packet, causing some data to remain in the buffer.

Risk Level: low 
Testing: Existing unit test
Docs Changes: N/A
Release Notes: N/A
Fixes envoyproxy#7970 

Signed-off-by: tianqian.zyf <tianqian.zyf@alibaba-inc.com>

* stats: Shared cluster isolated stats (envoyproxy#8118)

* shared the main symbol-table with the isolated stats used for cluster info.

Signed-off-by: Joshua Marantz <jmarantz@google.com>

* protodoc: upgrade to Python 3. (envoyproxy#8129)

Risk level: Low
Testing: Rebuilt docs, manual inspection of some example generated files.

Signed-off-by: Harvey Tuch <htuch@google.com>

* protodoc: single source-of-truth for doc protos. (envoyproxy#8132)

This avoids having to list new docs protos in both docs/build.sh and
api/docs/BUILD. This technical debt cleanup is helpful in v3 proto work
to simplify collecting proto artifacts from a Bazel aspect.

Risk level: Low
Testing: docs/build.sh, visual inspection of docs.

Signed-off-by: Harvey Tuch <htuch@google.com>

* api: organize go_proto_libraries (envoyproxy#8003)

Fixes envoyproxy#7982

Defines a package level proto library and its associated internal go_proto_library.

Deletes all existing api_go_proto_library, api_go_grpc_library, and go_package annotations in protos (they are not required and pollute the sources).

I deliberately avoided touching anything under udpa since it's being moved to another repository.

Risk Level: low
Testing: build completes

Signed-off-by: Kuat Yessenov <kuat@google.com>

* api: straggler v2alpha1 -> v3alpha clone. (envoyproxy#8133)

These were missed in envoyproxy#8125.

Signed-off-by: Harvey Tuch <htuch@google.com>

* docs: remove extraneous escape (envoyproxy#8150)

Old versions of bash (e.g. on macOS) don't handle ${P/:/\/} the same way as modern versions. In particular, the expanded parameter on macOS includes a backslash, causing subsequent use of the string as a filename to include a slash (/) instead of treating the slash as a directory separator. Both versions of bash accept ${P/://} as a way to substitute : with /. Verified that this change does not alter the generated docs when running under Linux.

Risk Level: low
Testing: generated docs under linux & macOS

Signed-off-by: Stephan Zuercher <zuercher@gmail.com>

* Do not 503 on Upgrade: h2c instead remove the header and ignore. (envoyproxy#7981)

Description: When a request comes in on http1 with "upgrade: h2c", the current behavior is to 503.  Instead we should ignore the upgrade and remove the header and continue with the request as http1.
Risk Level: Medium
Testing: Unit test. Hand test with ithub.com/rdsubhas/java-istio client server locally.
Docs Changes: N/A
Release Notes:  http1: ignore and remove Upgrade: h2c.
Fixes istio/istio#16391

Signed-off-by: John Plevyak <jplevyak@gmail.com>

* docs: add line on installing xcode for macOS build flow (envoyproxy#8139)

Because of rules_foreign_cc in bazelbuild, Envoy will not compile successfully when following the instructions in the build docs due to how the tools are referenced. One fix for this is installing Xcode from the App Store (see bazel-contrib/rules_foreign_cc#185).

Risk Level: Low
Testing: N/A (docs change)
Docs Changes: see Description
Release Notes: N/A

Signed-off-by: Lisa Lu <lisalu@lyft.com>

* docs: note which header expressions cannot be used for request headers (envoyproxy#8138)

As discussed in envoyproxy#8127, some custom header expressions evaluate as
empty when used in request headers.

Risk Level: low, docs only
Testing: n/a
Docs Changes: updated
Release Notes: n/a

Signed-off-by: Stephan Zuercher <zuercher@gmail.com>

* api: use traffic_direction over operation_name if specified (envoyproxy#7999)

Use the listener-level field for the tracing direction over the per-filter field. Unfortunately, the per filter did not provide an "unspecified" default, so this appears to be the right approach to deprecate the per-filter field with minimal impact.

Risk Level: low (uses a newly introduce field traffic_direction)
Testing: unit test
Docs Changes: proto docs

Signed-off-by: Kuat Yessenov <kuat@google.com>

* add more diagnostic logs (envoyproxy#8153)

Istio sets listener filter timeout to 10ms by default but requests fail from time to tome. It's very difficult to debug. Even though downstream_pre_cx_timeout_ is exposed to track the number of timeouts, it would be better to have some debug logs.

Description: add more diagnostic logs
Risk Level: low

Signed-off-by: crazyxy <yxyan@google.com>

* http conn man: add tracing config for path length in tag (envoyproxy#8095)

This PR adds a configuration option for controlling the length of the request path that is included in the HttpUrl span tag. Currently, this length is hard-coded to 256. With this PR, that length will be configurable (defaulting to the old value).

Risk Level: Low
Testing: Unit
Docs Changes: Inline with the API proto. Documented new field.
Release Notes: Added in the PR.

Related issue: istio/istio#14563

Signed-off-by: Douglas Reid <douglas-reid@users.noreply.github.com>

* cds: Add general-purpose LB policy configuration (envoyproxy#7744)

This PR adds fields to CDS that allow for general-purpose LB policy configuration.

Risk Level: Low
Testing: None (but if anything is needed, please let me know)
Docs Changes: Inline with API protos
Release Notes: N/A

Signed-off-by: Mark D. Roth <roth@google.com>

* thrift_proxy: fix crash on invalid transport/protocol (envoyproxy#8143)

Transport/protocol decoder errors that occur before the connection manager
initializes an ActiveRPC to track the request caused a crash. Modifies the
connection manager to handle this case, terminating the downstream the
connection.

Risk Level: low
Testing: test case that triggers crash
Docs Changes: n/a
Release Notes: added

Signed-off-by: Stephan Zuercher <zuercher@gmail.com>

* api: strip gogoproto annotations (envoyproxy#8163)

Remove gogoproto annotations. They can be replaced with a custom gogoproto compiler (e.g. something like https://github.com/gogo/googleapis/tree/master/protoc-gen-gogogoogleapis). I have an experimental version of it to validate that it's possible to re-apply important annotations in the compiler.

Risk Level: low
Testing: builds

Signed-off-by: Kuat Yessenov <kuat@google.com>

* hotrestart: remove dynamic_resources from server config used by hotrestart_test (envoyproxy#8162)

In the server config file `test/config/integration/server.yaml` used by
//test/integration:hotrestart_test, `dynamic_resources` includes `lds_config`
and `cds_config` definitions, which use HTTP API to fetch config, but CDS and
LDS service do not exist, so the initial fetch will be failed with a
connection failure, then Envoy server continue startup.

Envoy server shouldn't continue startup because connection failure, see
issue envoyproxy#8046.

For this test, `dynamic_resources` is not needed, this change clean it up.

Signed-off-by: lhuang8 <lhuang8@ebay.com>

* clang-tidy: misc-unused-using-decls (envoyproxy#8159)

Description: clang-tidy check to flag unused using statements. There's a lot in test code that's just copy pasta, and it's hard to manually review whether it's being used, especially for things like using testing::_;
Risk Level: low
Testing: existing
Docs Changes: N/A
Release Notes: N/A

Signed-off-by: Derek Argueta <dereka@pinterest.com>

* build: curl with c-ares, nghttp2 and zlib (envoyproxy#8154)

Build curl dependency with async DNS resolver c-ares avoiding potential
crashes due to longjmp on modern kernels. Add zlib and nghttp2.
Use Envoy's version of all of the above libraries.

Signed-off-by: Taras Roshko <troshko@netflix.com>

* log: add upstream TLS info (envoyproxy#7911)

Description: add upstream TLS info for logging purposes

Refactor SSL connection info to be a shared pointer.
Use read-only interface.
Cache computed values in the SSL info object (this allows transition to remove the underlying SSL object if necessary).

Risk Level: medium due to use of bssl::SSL to back ConnectionInfo
Testing: unit
Docs Changes: none
Release Notes: add upstream TLS info

Signed-off-by: Kuat Yessenov <kuat@google.com>

* fix windows implementation of PlatformImpl (envoyproxy#8169)

Add missing destructor to class declaration.
Fix copy/paste errors.
These errors were apparently introduced in e1cd4cc.

Risk Level: Low
Testing: Passed Windows testing locally
Docs Changes: n/a
Release Notes: n/a

Signed-off-by: William Rowe wrowe@pivotal.io
Signed-off-by: Yechiel Kalmenson <ykalmenson@pivotal.io>

* Update Opencensus SHA (envoyproxy#8173)

Signed-off-by: Pengyuan Bian <bianpengyuan@google.com>

* Outlier Detection: use gRPC status code for detecting failures (envoyproxy#7942)

Signed-off-by: ZhouyihaiDing <ddyihai@google.com>

* fix build (envoyproxy#8177)

Signed-off-by: Derek Argueta <dereka@pinterest.com>

* docs: improving websocket docs (envoyproxy#8156)

Making it clear H2 websockets don't work by default

Risk Level: n/a
Testing: n/a
Docs Changes: yes
Release Notes: no
envoyproxy#8147

Signed-off-by: Alyssa Wilk <alyssar@chromium.org>

* Upstream WebAssembly VM and Null VM from envoyproxy/envoy-wasm. (envoyproxy#8020)

Description: Upstream from envoyproxy/envoy-wasm the WebAssembly VM support along with the Null VM support and tests. This is the first PR dealing with WebAssembly filter support in envoy.  See https://github.com/envoyproxy/envoy-wasm/blob/master/WASM.md and https://github.com/envoyproxy/envoy-wasm/blob/master/docs/root/api-v2/config/wasm/wasm.rst for details.
Risk Level: Medium
Testing: Unit tests.
Docs Changes: N/A
Release Notes: N/A
Part of envoyproxy#4272 

Signed-off-by: John Plevyak <jplevyak@gmail.com>

* quiche: implement Envoy Quic stream and connection (envoyproxy#7721)

Implement QuicStream|Session|Disptacher in Envoy. Weir up QUIC stream and connection with HCM callbacks.

Risk Level: low, not in use
Testing: Added unit tests for all new classes
Part of envoyproxy#2557
Signed-off-by: Dan Zhang <danzh@google.com>

* protodoc/api_proto_plugin: generic API protoc plugin framework. (envoyproxy#8157)

Split out the generic plugin and FileDescriptorProto traversal bits from
protodoc. This is in aid of the work in envoyproxy#8082 ad envoyproxy#8083, where additional
protoc plugins will be responsible for v2 -> v3alpha API migrations and
translation code generation.

This is only the start really of the api_proto_plugin framework. I
anticipate additional bits of protodoc will move here later, including
field type analysis and oneof handling.

In some respects, this is a re-implementation of some of
https://github.com/lyft/protoc-gen-star in Python. The advantage is that
this is super lightweight, has few dependencies and can be easily
hacked. We also embed various bits of API business logic, e.g.
annotations, in the framework (for now).

Risk level: Low
Testing: diff -ru against previous protodoc.py RST output, identical modulo some
  trivial whitespace that doesn't appear in generated HTML. There are no
  real tests yet, I anticipate adding some golden proto style tests.

Signed-off-by: Harvey Tuch <htuch@google.com>

* adaptive concurrency: Gradient algorithm implementation (envoyproxy#7908)

Signed-off-by: Tony Allen <tallen@lyft.com>

* ext_authz: Check for cluster before sending HTTP request (envoyproxy#8144)

Signed-off-by: Dhi Aurrahman <dio@tetrate.io>

* make getters const-ref (envoyproxy#8192)

Description:
Follow-up to envoyproxy#7911 to make cached values be exposed as const-references, saving on a copy of a string during retrieval.

Risk Level: low
Testing: updated mocks to return references
Docs Changes: none
Release Notes: none

Signed-off-by: Kuat Yessenov <kuat@google.com>

* test: add curl features check (envoyproxy#8194)

Add a test ensuring curl was built with the expected features.

Description: Add a test ensuring curl was built with the expected features.
Risk Level: Low.
Testing: n/a.
Docs Changes: n/a.
Release Notes: n/a.

Signed-off-by: Taras Roshko <troshko@netflix.com>

* subset lb: allow ring hash/maglev LB to work with subsets (envoyproxy#8030)

* subset lb: allow ring hash/maglev LB to work with subsets

Skip initializing the thread aware LB for a cluster when the subset
load balancer is enabled. Also adds some extra checks for LB policies
that are incompatible with the subset load balancer.

Risk Level: low
Testing: test additional checks
Docs Changes: updated docs w.r.t subset lb compatibility
Release Notes: n/a
Fixes: envoyproxy#7651

Signed-off-by: Stephan Zuercher <zuercher@gmail.com>

* redis: add a request time metric to redis upstream (envoyproxy#7890)

Signed-off-by: Nicolas Flacco <nflacco@lyft.com>

* bazel: update bazel to 0.29.1 (envoyproxy#8198)

Description:
Upgrade bazel to 0.29.1 and bazel-toolchains to corresponding version.

Risk Level: Low
Testing: CI
Docs Changes: N/A
Release Notes: N/A

Signed-off-by: Lizan Zhou <lizan@tetrate.io>

* upstream: Add ability to disable host selection during panic (envoyproxy#8024)

Previously, when in a panic state, requests would be routed to all
hosts. In some cases it is instead preferable to not route any requests.
Add a configuration option for zone-aware load balancers which switches
from routing to all hosts to no hosts.

Closes envoyproxy#7550.

Signed-off-by: James Forcier jforcier@grubhub.com

Risk Level: Low
Testing: 2 new unit tests written; manual testing
Docs Changes: Note about new configuration option added
Release Notes: added

Signed-off-by: James Forcier <jforcier@grubhub.com>

* metrics service: flush histogram buckets (envoyproxy#8180)

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

* tracing: fix random sample fraction percent (envoyproxy#8205)

Signed-off-by: Pengyuan Bian <bianpengyuan@google.com>

* stats: Add per-host memory usage test case to stats_integration_test (envoyproxy#8189)

Signed-off-by: Antonio Vicente <avd@google.com>

* router check tool: add flag for only printing failed tests (envoyproxy#8160)

Signed-off-by: Lisa Lu <lisalu@lyft.com>

* fix link to runtime docs (envoyproxy#8204)

Description: Looks like the runtime docs moved under operations/. The PR fixes the link.
Risk Level: low
Testing: existing
Docs Changes: this
Release Notes: n/a

Signed-off-by: Derek Argueta <dereka@pinterest.com>

* config: make SlotImpl detachable from its owner, and add a new runOnAllThreads interface to Slot. (envoyproxy#8135)

See the issue in envoyproxy#7902, this PR is to make the SlotImpl detachable from its owner, by introducing a Booker object wraps around a SlotImpl, which bookkeeps all the on-the-fly update callbacks. And on its destruction, if there are still on-the-fly callbacks, move the SlotImpl to an deferred-delete queue, instead of destructing the SlotImpl which may cause an SEGV error.

More importantly, introduce a new runOnAllThreads(ThreadLocal::UpdateCb cb) API to Slot, which requests a Slot Owner to not assume that the Slot or its owner will out-live (in Main thread) the fired on-the-fly update callbacks, and should not capture the Slot or its owner in the update_cb.

Picked RDS and config-providers-framework as examples to demonstrate that this change works. {i.e., changed from the runOnAllThreads(Event::PostCb) to the new runOnAllThreads(TLS::UpdateCb) interface. }

Risk Level: Medium
Testing: unit test
Docs Changes: N/A
Release Notes: N/A
[Optional Fixes #Issue] envoyproxy#7902

Signed-off-by: Xin Zhuang <stevenzzz@google.com>

* test: remove static config from subset lb integration test (envoyproxy#8203)

Build the config programmatically to make future API changes less
onerous.

Risk Level: low (test change only)
Testing: n/a
Doc Changes: n/a
Release Notes: n/a

Signed-off-by: Stephan Zuercher <zuercher@gmail.com>

* cleanup: clarify Cluster.filters and Dispatcher::createClientConnection (envoyproxy#8186)

Signed-off-by: Fred Douglas <fredlas@google.com>

* redis: health check is not sending the auth command on its connection (envoyproxy#8166)

Signed-off-by: Henry Yang <hyang@lyft.com>

* redis: mirroring should work when default value is zero, not just greater than zero (envoyproxy#8089)

Signed-off-by: Nicolas Flacco <nflacco@lyft.com>

* tools: regularize pip/venv for format_python_tools.py. (envoyproxy#8176)

As well as being a nice cleanup, this fixes some issues I had with local
Docker use of fix_format as a non-root user.

Signed-off-by: Harvey Tuch <htuch@google.com>

* absl: Absl hash hook in a couple of places rather than hash functors (envoyproxy#8179)

Signed-off-by: Joshua Marantz <jmarantz@google.com>

* Update dependency: jwt_verify_lib (envoyproxy#8212)

Signed-off-by: Daniel Grimm <dgrimm@redhat.com>

* upstream: add failure percentage-based outlier detection (envoyproxy#8130)

Description: Add a new outlier detection mode which compares each host's rate of request failure to a configured fixed threshold.

Risk Level: Low
Testing: 2 new unit tests added.
Docs Changes: New mode and config options described.
Release Notes: white_check_mark
Fixes envoyproxy#8105

Signed-off-by: James Forcier <jforcier@grubhub.com>

* Replace deprecated thread annotations macros. (envoyproxy#8237)

Abseil thread annotation macros are now prefixed by ABSL_.

There is no semantic change; this is just a rename.

Signed-off-by: Yan Avlasov <yavlasov@google.com>

* Update protoc-gen-validate (PGV) (envoyproxy#8234)

This picks up fixes for the Windows build and a C preprocessor defect

Signed-off-by: Yechiel Kalmenson <ykalmenson@pivotal.io>
Signed-off-by: William Rowe <wrowe@pivotal.io>

* upstream: use named constants for outlier detection config defaults (envoyproxy#8221)

Signed-off-by: James Forcier <jforcier@grubhub.com>

* server: add a post init lifecycle stage (envoyproxy#8217)

Signed-off-by: Jose Nino <jnino@lyft.com>

* docs: document access control conditions and attributes (envoyproxy#8230)

Signed-off-by: Kuat Yessenov <kuat@google.com>

* server: return processContext as optional reference (envoyproxy#8238)

Signed-off-by: Elisha Ziskind <eziskind@google.com>

* Update envoy.yaml in Redis proxy example (envoyproxy#8220)

Description: Make Redis example use catch_all_route.
Risk Level: Low.
Testing: Done. docker-compose up --build brings up envoy proxy and I was able to run Redis commands using redis-cli.

Signed-off-by: Raju Kadam <rkadam@atlassian.com>

* quiche: implement ActiveQuicListener (envoyproxy#7896)

Signed-off-by: Dan Zhang <danzh@google.com>

* srds: allow SRDS pass on scope-not-found queries to filter-chain (issue envoyproxy#8236).  (envoyproxy#8239)

Description: Allow a no-scope request to pass through the filter chain, so that some special queries (e.g., data plane health-check ) can be processed by the customized filter-chain. By default, the behavior is the same (404).
Risk Level: LOW
Testing: unit test and integration test.
Docs Changes: N/A
Release Notes: N/A
Fixes envoyproxy#8236
Signed-off-by: Xin Zhuang <stevenzzz@google.com>

* Updated to new envoyproxy master branch.

Signed-off-by: John Plevyak <jplevyak@gmail.com>

* Remove offending go proto option.

Signed-off-by: John Plevyak <jplevyak@gmail.com>

* Fix format/tidy issues.

Signed-off-by: John Plevyak <jplevyak@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants