Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always use file based heartbeat #19666

Merged
merged 8 commits into from
Jan 14, 2020

Conversation

carbonin
Copy link
Member

Based on #19665

@Fryguy
Copy link
Member

Fryguy commented Jan 2, 2020

#19665 is merged

@Fryguy Fryguy self-assigned this Jan 2, 2020
@Fryguy
Copy link
Member

Fryguy commented Jan 2, 2020

My only concern with the heartbeat file is that since we've only used it in a container so far, it had no chance for file collisions, since it's only one worker per container. So, I'm not sure if lines like this matter now -

guid ||= "miq_worker"

@carbonin carbonin force-pushed the always_use_file_based_heartbeat branch from ea28379 to 928af24 Compare January 6, 2020 19:38
@carbonin
Copy link
Member Author

carbonin commented Jan 6, 2020

@Fryguy That one is a bit misleading ... MiqWorker defines its own heartbeat_file method here

The MiqDefaults version is really just responsible for defining the location of the file and happens to deal with the env var as well.

@carbonin carbonin force-pushed the always_use_file_based_heartbeat branch from 928af24 to efb4593 Compare January 6, 2020 20:28
This separates getting messages from the server over drb from the
actual heartbeat which happens to the file.
The config change timestamp is stored in memcached and is checked
after each worker heartbeat. Sending this message over drb isn't
necessary anymore
…ectly

The indirection here was put in place to match the existing drb message
processing as closely as possible, now that that has been removed
this can be simplified.
Additionally remove spec for #process_message which was really
a spec for #sync_config
@carbonin
Copy link
Member Author

carbonin commented Jan 7, 2020

Waiting on ManageIQ/vmware_web_service#70 to be released so that the broker doesn't fail with:

[----] E, [2020-01-07T11:52:58.338666 #21305:2ae59a9565f0] ERROR -- : MIQ(MiqVimBrokerWorker::Runner) ID [463] PID [21305] GUID [36f41131-9bcb-4605-a314-fc3f5b419009] Error heartbeating because NameError: uninitialized constant Timeout::ExitException
/home/ncarboni/.gem/ruby/2.5.5/gems/vmware_web_service-0.4.7/lib/VMwareWebService/broker_timeout.rb:39:in `timeout'
/home/ncarboni/.gem/ruby/2.5.5/gems/dalli-2.7.6/lib/dalli/socket.rb:131:in `open'
/home/ncarboni/.gem/ruby/2.5.5/gems/dalli-2.7.6/lib/dalli/server.rb:586:in `connect'
/home/ncarboni/.gem/ruby/2.5.5/gems/dalli-2.7.6/lib/dalli/server.rb:95:in `alive?'
/home/ncarboni/.gem/ruby/2.5.5/gems/dalli-2.7.6/lib/dalli/options.rb:24:in `block in alive?'
/home/ncarboni/.rubies/ruby-2.5.5/lib/ruby/2.5.0/monitor.rb:226:in `mon_synchronize'
/home/ncarboni/.gem/ruby/2.5.5/gems/dalli-2.7.6/lib/dalli/options.rb:23:in `alive?'
/home/ncarboni/.gem/ruby/2.5.5/gems/dalli-2.7.6/lib/dalli/ring.rb:42:in `server_for_key'
/home/ncarboni/.gem/ruby/2.5.5/gems/dalli-2.7.6/lib/dalli/client.rb:361:in `perform'
/home/ncarboni/.gem/ruby/2.5.5/gems/dalli-2.7.6/lib/dalli/client.rb:58:in `get'
/home/ncarboni/Source/manageiq/app/models/miq_worker/runner.rb:378:in `server_last_change'
/home/ncarboni/Source/manageiq/app/models/miq_worker/runner.rb:362:in `config_out_of_date?'
/home/ncarboni/Source/manageiq/app/models/miq_worker/runner.rb:325:in `heartbeat'
/home/ncarboni/Source/manageiq/app/models/miq_worker/runner.rb:295:in `block in do_work_loop'
/home/ncarboni/Source/manageiq/app/models/miq_worker/runner.rb:293:in `loop'
/home/ncarboni/Source/manageiq/app/models/miq_worker/runner.rb:293:in `do_work_loop'
/home/ncarboni/Source/manageiq/app/models/miq_worker/runner.rb:137:in `run'
/home/ncarboni/Source/manageiq/app/models/miq_worker/runner.rb:118:in `start'
/home/ncarboni/Source/manageiq/lib/workers/bin/run_single_worker.rb:113:in `<main>' Worker exiting.

This was a problem for the EventCatcher when sync_config was called
as a part of worker initialization. The EventCatcher uses its ems
hostname as a part of its log prefix, but the ems isn't set up until
after the first call to sync_config.

This was leading to errors when the worker was starting:
[----] I, [2019-12-19T17:51:48.407958 ManageIQ#11054:2ab1a6d9a5f8]  INFO -- : Starting ManageIQ::Providers::Vmware::InfraManager::EventCatcher with runner options {:ems_id=>"2", :guid=>"43a315ba-39ff-4326-99a6-89b3af48b1ee"}
[----] I, [2019-12-19T17:51:48.451836 ManageIQ#11054:2ab1a6d9a5f8]  INFO -- : Deleting worker record for ManageIQ::Providers::Vmware::InfraManager::EventCatcher, id 310
/home/ncarboni/Source/manageiq/app/models/manageiq/providers/base_manager/event_catcher/runner.rb:63:in `log_prefix': undefined method `hostname' for nil:NilClass (NoMethodError)
        from /home/ncarboni/Source/manageiq/app/models/miq_worker/runner.rb:242:in `sync_config'
        from /home/ncarboni/Source/manageiq/app/models/miq_worker/runner.rb:52:in `worker_initialization'
        from /home/ncarboni/Source/manageiq/app/models/miq_worker/runner.rb:42:in `initialize'
        from /home/ncarboni/Source/manageiq/lib/workers/bin/run_single_worker.rb:113:in `new'
        from /home/ncarboni/Source/manageiq/lib/workers/bin/run_single_worker.rb:113:in `<main>'
@carbonin carbonin force-pushed the always_use_file_based_heartbeat branch from 17076f9 to 010055a Compare January 7, 2020 20:11
@carbonin carbonin changed the title [WIP] Always use file based heartbeat Always use file based heartbeat Jan 7, 2020
@miq-bot miq-bot removed the wip label Jan 7, 2020
@carbonin
Copy link
Member Author

carbonin commented Jan 7, 2020

@Fryguy this should be ready for real review.

@carbonin
Copy link
Member Author

carbonin commented Jan 7, 2020

Also, this is related to the pods and zones epic ManageIQ/manageiq-pods#353

@@ -32,8 +32,6 @@ def monitor_workers
persist_last_heartbeat(worker)
# Check the worker record for heartbeat timeouts
next unless validate_worker(worker)
# Tell the valid workers to sync config if needed
worker_set_message(worker, "sync_config") if resync_needed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😮 Awesome

Copy link
Member

@jrafanie jrafanie Jan 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though, how do the worker know when config is changed now?


if config_out_of_date?
_log.info("#{log_prefix} Synchronizing configuration...")
sync_config
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, there it is. Ignore my above question.

worker_monitor_drb.worker_get_messages(@worker.pid).each do |msg, *args|
process_message(msg, *args)
end
rescue DRb::DRbError => err
do_exit("Error heartbeating to MiqServer because #{err.class.name}: #{err.message}", 1)
do_exit("Error processing messages from MiqServer because #{err.class.name}: #{err.message}", 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this PR is about removing DRb-based heartbeating, why do we have rescue DRb::DRbError here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method in particular is not for heartbeat. This fetches messages set for the worker in the @workers hash over drb.

The vim broker is the only worker that still uses this functionality (previously it was also used for config sync and stopping workers), so this can all be removed once the vim broker is gone.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, processing messages and heartbeating are two different operations that we happen to execute at the same time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the clarification @carbonin

…ymore

The side-effects of this method are still needed
@miq-bot
Copy link
Member

miq-bot commented Jan 8, 2020

Checked commits carbonin/manageiq@849eec9~...fe52aad with ruby 2.5.5, rubocop 0.69.0, haml-lint 0.20.0, and yamllint 1.10.0
8 files checked, 0 offenses detected
Everything looks fine. ⭐

@Fryguy Fryguy merged commit 776b041 into ManageIQ:master Jan 14, 2020
@Fryguy Fryguy added this to the Sprint 128 Ending Jan 20, 2020 milestone Jan 14, 2020
@carbonin carbonin deleted the always_use_file_based_heartbeat branch April 23, 2020 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants