Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

networking: option to ignore static port collisions #23956

Merged
merged 1 commit into from
Sep 17, 2024
Merged

Conversation

gulducat
Copy link
Member

@gulducat gulducat commented Sep 13, 2024

tl;dr - Add an option to ignore port collisions for static ports during scheduling, if the group.network.mode is "host" (the default). This allows more than one copy of a program that uses SO_REUSEPORT to bind to the same port on a single node.

Note: some task drivers (like docker) may require additional config to set network_mode = "host", or you may encounter runtime issues after placement.

Closes #15487

Problem

Currently, static ports may be used only once on a single address on a network interface.

This is good, almost all of the time. Most programs bind exclusively to an addr:port, so other copies of the same program (or other programs) cannot multi-bind to it. The Nomad scheduler prevents placing allocations that would cause such port collisions to avoid a predictable program error at runtime.

Some programs, however, do allow addr:port reuse, via the SO_REUSEPORT socket option (on linux). For example, this little toy web server:

reuseport.py
#!/usr/bin/env python3

import http.server
import socket
import socketserver
from http import HTTPStatus
from os import environ

ADDR = environ.get('ADDR', '')
PORT = int(environ.get('PORT', 8000))
MESSAGE = environ.get('MESSAGE', 'hello')


class Handler(http.server.BaseHTTPRequestHandler):
    def do_HEAD(self):
        self.head()
    def do_GET(self):
        self.head()
        self.wfile.write(bytes(f'{MESSAGE}!\n', 'utf-8'))
    def head(self):
        self.send_response(HTTPStatus.OK)
        self.send_header('Content-Type', 'text/html')
        self.end_headers()


class ReusableServer(socketserver.TCPServer):
    def server_bind(self):
        # socket.SO_REUSEPORT is the magic ingredient.
        self.socket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEPORT, 1)
        super().server_bind()


with ReusableServer((ADDR, PORT), Handler) as httpd:
    print(f"serving at {ADDR}:{PORT}", flush=True)
    httpd.serve_forever()

A more practical example is traefik, which also (optionally) supports it, and while I haven't taken a full inventory, I suspect other proxies do, too.

Nomad's network.port prevents this use case. To work around it, one may merely neglect to specify the port, and then still bind to it anyway, but this completely removes the guardrails that prevent other allocs from being mistakenly placed with no hope of success.

Solution

This PR adds a new ignore_collision field to job.group.network.port (only this scope, i.e. not task networks):

job "web" {
    group "g" {
        count = 2
        network {
            mode = "host" # the default, and required for this feature
            port "http" {
                static = 8000
                to     = 8000

                ignore_collision = true
            }
        }
        task "t" {
            driver = "docker"
            config {
                ...
                network_mode = "host" # required for docker or podman drivers
            ...
full example: reuseport.nomad.hcl

(uses reuseport.py from above)

job "reuseport" {
  group "docker" {
    count = 2
    network {
      mode = "host"
      port "http" {
        static = 8000
        to     = 8000

        ignore_collision = true
      }
    }
    task "t" {
      driver = "docker"
      config {
        image   = "python:slim"
        command = "python3"
        args    = ["/local/reuseport.py"]

        network_mode = "host"
      }
      template {
        destination = "local/reuseport.py"
        data        = file("reuseport.py")
      }
      env {
        MESSAGE = "hello, ${NOMAD_ALLOC_ID}"
      }
    }
  }
}

A job run like that can have multiple copies placed on the same host, then another job that requests the same static port without the new option will be placed elsewhere (or block pending the port becoming available).

This just gets the scheduler out of the way. It's up to the user (job author and/or cluster operator) to ensure placement on a node that does not already have some other thing listening on the port.

For example, I can imagine a client config like this:

client {
    node_pool = "web-proxy"
    reserved {
        reserved_ports = "8000"
    }
    ...

reserved_ports would prevent placement of any group that does not allow port reuse, ahead of the desired job, then node_pool (or some other constraint method) ensures the desired job gets placed on such a node.

I do wonder, though, if port.allow_reuse actually adds anything beyond merely excluding the port, if reserved_ports already prevents collisions from other jobs... Perhaps a system job that runs everywhere, all the time, so could grab the port without needing to reserve it in client config? I am a little hazy on the real value-add here.

Security-wise, this does not increase risk of some other program trying to squat on ports any more than a host-network'd job currently can, if it simply does not declare that it will use a port at all.

Questions:

  1. What specific use cases are solved by this new option?
  2. What specific use cases are not solved by it?
  3. Any UX or security concerns I haven't considered?
  4. Instead of allow_reuse, perhaps ignore_collisions is more precise? (and feels like a good warning)
    • I did go ahead and do this, because it's more reflective of the behavior of the parameter

@tgross
Copy link
Member

tgross commented Sep 13, 2024

I do wonder, though, if port.allow_reuse actually adds anything beyond merely excluding the port, if reserved_ports already prevents collisions from other jobs.

There are a couple of advantages:

  • Using reserved_ports and no network.port doesn't allow the job author to register the port by name as a service.
  • The reserved_ports configuration is owned by the cluster admin and not the job author, and cluster admins will usually want to get out of the way of job authors here (we expect reserved_ports is going to be for services the cluster admin is running outside of Nomad, like ssh)

@@ -7264,6 +7273,12 @@ func (tg *TaskGroup) validateNetworks() error {
err := fmt.Errorf("Port %q cannot be mapped to a port (%d) greater than %d", port.Label, port.To, math.MaxUint16)
mErr.Errors = append(mErr.Errors, err)
}

// TODO: is "" always "host" ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weirdly it looks like we don't canonicalize NetworkResource.Mode, but it's the default for the zero value.

Comment on lines 440 to 442
if !port.AllowReuse {
collide = true
reason := fmt.Sprintf("port %d already in use", port.Value)
reasons = append(reasons, reason)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might need to call used.Set here for when we're ranking a node and adding that node's existing allocations to the NetworkIndex. If those allocations have port.allow_reuse we want them to collide with the new allocation if that allocation doesn't also have port.allow_reuse for that port.

(This will probably pop out once you've got test coverage in the scheduler code.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only get into this condition if used.Check() (line 438 just above here), i.e. if used.Set() (line 447) has already been called by some other prior entry here.

the full context:

used := idx.getUsedPortsFor(port.HostIP)
if used.Check(uint(port.Value)) {
// TODO: yes. 2.
if !port.AllowReuse {
collide = true
reason := fmt.Sprintf("port %d already in use", port.Value)
reasons = append(reasons, reason)
}
} else {
// TODO: i think #1 reads this later
used.Set(uint(port.Value))
}

(those "todo: yes. #." comments are the order in which I encountered collisions while stepping through breakpoints.)

In other words, Set() always happens if it hasn't already, regardless of AllowReuse, then subsequent Check()s just may or may not report collisions.

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

nomad/structs/structs.go Show resolved Hide resolved
@@ -105,6 +105,12 @@ All other operating systems use the `host` networking mode.
- `host_network` `(string:nil)` - Designates the host network name to use when allocating
the port. When port mapping the host port will only forward traffic to the matched host
network address.
- `ignore_collision` `(bool: false)` - Allows the group to be placed on a node
where the port may already be reserved. Intended for programs that support
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to use the word "reserved" here? There's a specific client configuration option for "reserved ports" that may be confusing. Maybe "allocated" or even just "in use" (maybe the port isn't being used by a Nomad job at all)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually imagine that this may be used in conjunction with the reserved_ports client config, so a cluster operator may dedicate ports for applications like these, and run them with Nomad instead of systemd or such.

On the flip side, it also allows job authors to bypass operator intent... I suppose Sentinel may need to come into play here to retain current guarantees. 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought some more about bypassing operator intent around reserved_ports: a job author can already do that today, if they merely do not specify a port{}, so this does not make that situation any worse. (I have had to remind myself of this more than once.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent point!

so more than one copy of a program can run
at a time on the same port with SO_REUSEPORT.

requires host network mode, and some drivers
(like docker) may also need
config {
  network_mode = "host"
}
but this is not validated prior to placement.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support SO_REUSEPORT for safe reuse of static ports
2 participants