Skip to content

nomad: deep dive into networking #30

@noahehall

Description

@noahehall

C

  • previously we made a deicsion to bake envoy + consul into each dker image, hopefully this doesnt backfire on us with nomad integration
    • any issues we're facing at the network layer is pure knowledge gap
    • im sure the architecture is sound unless evidence proves otherwise
  • nomad has first class consul (and vault) integration,
  • however, we are using nomad to start the consul service: lets see how this chicken-and-egg dependency plays out
    • best case scenario:
      • we can leave the consul + envoy baked into each image, supporting interoperability between envs
      • we dont need to setup consul for nomad tasks
        • we just need to point upstreams to the consul allocation
        • this can be achieved via a template on each task, than queries nomad service X to find retrieve the service IP
      • register nomad clients with the consul agent for the task their running
        • this is overkil: we just need to know where the services are deployed, then consul + envoy will take over
      • or perhaps set group.service.provider === nomad
        • worked perfectly

  • best case worked out perfectly: leaving this here for when I forget in the future
    • workaround scenario 1:
    • we create a user-defined network and have all clients join it
    • then upstreams can discover core-consul via nomad SRV records
    • all services use consul intentions anyway to manage authnz, so this shouldnt be too much of a security concern
    • workaround scenario 2:
      • we do a soft integration with consul + nomad, just for service discovery between allocations
      • one thing to watch for is redundant envoy + consul processes running
        • each cunt has a bootstrap file for managing the consul agent + envoy sidecar thats baked into the image
        • if we then run another consul + envoy process for nomad, that redundancy seems wasteful
    • worst case scenario: we have to remove consul + envoy from the image
      • this will require us to add additional docker services (1 for consul, 1 for envoy) for each application service in the compose file for development
        • definitely not something we want to do, hence why we baked them into the image
        • we will have to dupliate that logic in nomad for each env,
          • not something we want to do, hbence why we baked them into the image
    • less worst case, but stilll worst case scenario
      • use nomad for development:
        • then having consul + envoy baked into the image will be the problem, instead of this ticket
        • we can configure consul + envoy as a system job and it will automatically be provisioned on each client
          • this is idiomatic nomad
      • not something we want to do, nothing beats just pure fkn docker for development
        • lol hence why we baked the fkn consul and envoy into the image
      • we have validation, explicitly for running prod-like environment without imposing restrictions/non-dev concerns on developers

T

  • docker tasks use docker bridge and not nomad bridge, so we need to configure it
    • group.service: attrs to review
      • x
    • group.network: attrs to review, and should be used instead of task when attrs clash
      • x
    • task.config.X:
      • attrs to review
        • extra_hosts
        • ports
          • do a manual review of this, docker sets NOMAD_PORT_poop in each cunt
        • network_aliases: we can use the nomad runtime vars unlike docker to have distinct cunt aliases; but requires a user defined network
      • attrs to avoid
        • hostname
        • privileged
        • ipc_mode
        • ipv4_address
        • ipv6_address
      • must be configured at group.network
        • dns_search_domains
        • dns_options
        • dns_servers
        • network_mode
    • docker plugin conf
      • check the infra_image attr, from the docs it appears nomad hardcodes it to 3.1

A


  • issue 1: chatter across allocations
  • this was expected, as config is pretty much copypasted from the docker convert env file
  • core-consul (see below) hostname doesnt exist in validation
    • ^ it needs to point to the core-consul allocation ip
    • ^ or somehow discover on which client core-consul is allocated
  • sanity check:
    • set static port allocations for all core-consul (especially serf) ports
    • hard code core consul addr in core proxy retry_join attr
    • makes sense that it works with hardcoded values: since everythings running on my machine
    • still a useful sanity check
  • real fix: discovery....
--
2023-01-27T02:16:17.643Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=intention-match error="No known Consul servers" index=0
2023-01-27T02:16:17.643Z [ERROR] agent.proxycfg: Failed to handle update from watch: kind=connect-proxy proxy=core-proxy-1-sidecar-proxy service_id=core-proxy-1-sidecar-proxy id=intentions error="error filling agent cache: No known Consul servers"
--
2023-01-27T02:15:12.560Z [INFO]  agent.client.serf.lan: serf: Attempting re-join to previously known node: core-vault-247bb920bc1a: 172.21.0.2:8301
2023-01-27T02:15:12.918Z [INFO]  agent: (LAN) joining: lan_addresses=["core-consul"]
2023-01-27T02:15:12.941Z [WARN]  agent.router.manager: No servers available
2023-01-27T02:15:12.978Z [WARN]  agent.client.memberlist.lan: memberlist: Failed to resolve core-consul: lookup core-consul on 192.168.0.1:53: no such host
2023-01-27T02:15:12.978Z [WARN]  agent: (LAN) couldn't join: number_of_nodes=0
  error=
  | 1 error occurred:
  | 	* Failed to resolve core-consul: lookup core-consul on 192.168.0.1:53: no such host
  | 
  
2023-01-27T02:15:12.978Z [WARN]  agent: Join cluster failed, will retry: cluster=LAN retry_interval=10s
  error=
  | 1 error occurred:
  | 	* Failed to resolve core-consul: lookup core-consul on 192.168.0.1:53: no such host
  | 

-- issue: token/acl

--
2023-01-27T04:28:12.087Z [INFO]  agent.client.serf.lan: serf: Attempting re-join to previously known node: core-proxy-da6a390b2832: 172.22.0.3:8301
127.0.0.1:53492 [27/Jan/2023:04:28:12.107] edge forward_https/serverhttps 1/-1/+0 +0 -- 1/1/0/0/1 0/0
2023-01-27T04:28:13.388Z [ERROR] agent.client: RPC failed to server: method=Coordinate.Update server=172.26.65.117:8300 error="rpc error making call: Permission denied: token with AccessorID 'bdad85af-9fc8-e41d-593f-c73cebef40fc' lacks permission 'node:write' on \"core-proxy-4652f5c62fdf\""
2023-01-27T04:28:13.388Z [WARN]  agent: Coordinate update blocked by ACLs: accessorID=bdad85af-9fc8-e41d-593f-c73cebef40fc
--
127.0.0.1:39828 [27/Jan/2023:04:28:22.108] edge forward_https/serverhttps 1/-1/+0 +0 -- 1/1/0/0/1 0/0
2023-01-27T04:28:22.580Z [ERROR] agent.client: RPC failed to server: method=Catalog.Register server=172.26.65.117:8300 error="rpc error making call: Permission denied: token with AccessorID 'bdad85af-9fc8-e41d-593f-c73cebef40fc' lacks permission 'node:write' on \"core-proxy-4652f5c62fdf\""
2023-01-27T04:28:22.580Z [WARN]  agent: Node info update blocked by ACLs: node=3bb036b6-c034-7abb-42df-01c8f7a5b1ea accessorID=bdad85af-9fc8-e41d-593f-c73cebef40fc
--

  • issue: vault backend
  • this makes sense because vault has been commented
[NOTICE]   (15) : haproxy version is 2.7.1-3e4af0e
[NOTICE]   (15) : path to executable is /usr/local/sbin/haproxy
[WARNING]  (15) : config : [/var/lib/haproxy/configs/002-001-vault.cfg:19] : 'server lb-vault/core-vault-c-dns1' : could not resolve address 'core-vault.service.search', disabling server.
[WARNING]  (15) : config : [/var/lib/haproxy/configs/002-001-vault.cfg:20] : 'server lb-vault/core-vault-d-dns1' : could not resolve address 'core-vault', disabling server.
[NOTICE]   (15) : New worker (71) forked
[NOTICE]   (15) : Loading success.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    THE GROOVE

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions