nomad: deep dive into networking

# C

- previously we made a deicsion to bake envoy + consul into each dker image, hopefully this doesnt backfire on us with nomad integration
  - any issues we're facing at the network layer is pure knowledge gap
  - im sure the architecture is sound unless evidence proves otherwise
- nomad has first class consul (and vault) integration, 
- however, we are using nomad to start the consul service: lets see how this chicken-and-egg dependency plays out
  - best case scenario: 
    - we can leave the consul + envoy baked into each image, supporting interoperability between envs
    - we dont need to setup consul for nomad tasks
      - we just need to point upstreams to the consul allocation 
      - this can be achieved via a template on each task, than queries `nomad service X` to find retrieve the service IP
    - register nomad clients with the consul agent for the task their running
      - this is overkil: we just need to know where the services are deployed, then consul + envoy will take over
    - or perhaps set group.service.provider === nomad
      - worked perfectly 
 ---
- best case worked out perfectly: leaving this here for when I forget in the future
   - workaround scenario 1:
    - we create a user-defined network and have all clients join it
    - then upstreams can discover core-consul via nomad SRV records
    - all services use consul intentions anyway to manage authnz, so this shouldnt be too much of a security concern
  - workaround scenario 2:
    - we do a soft integration with consul + nomad, just for service discovery between allocations
    - one thing to watch for is redundant envoy + consul processes running
      - each cunt has a bootstrap file for managing the consul agent + envoy sidecar thats baked into the image
      - if we then run another consul + envoy process for nomad, that redundancy seems wasteful
  - worst case scenario: we have to remove consul + envoy from the image
    - this will require us to add additional docker services (1 for consul, 1 for envoy) for each application service in the compose file for development
      - definitely not something we want to do, hence why we baked them into the image
      - we will have to dupliate that logic in nomad for each env,
        - not something we want to do, hbence why we baked them into the image
  - less worst case, but stilll worst case scenario
    - use nomad for development: 
      - then having consul + envoy baked into the image will be the problem, instead of this ticket
      - we can configure consul + envoy as a system job and it will automatically be provisioned on each client
        - this is idiomatic nomad
    - not something we want to do, nothing beats just pure fkn docker for development
      - lol hence why we baked the fkn consul and envoy into the image
    -  we have validation, explicitly for running prod-like environment without imposing restrictions/non-dev concerns on developers 


# T
- [ ] docker tasks use docker bridge and not nomad bridge, so we need to configure it
  - group.service: attrs to review
    - [ ] x 
  - group.network: attrs to review, and should be used instead of task when attrs clash
    - [ ] x
  - task.config.X: 
    - attrs to review
      - [x] extra_hosts
      - [ ] ports
        - [ ] do a manual review of this, docker sets `NOMAD_PORT_poop` in each cunt 
      - [ ] network_aliases: we can use the nomad runtime vars unlike docker to have distinct cunt aliases; but requires a user defined network
    - attrs to avoid
      - hostname 
      - privileged
      - ipc_mode
      - ipv4_address
      - ipv6_address
    - must be configured at group.network
      - [ ] dns_search_domains 
      - [ ] dns_options
      - [ ] dns_servers
      - [ ] network_mode
  - [docker plugin conf](https://developer.hashicorp.com/nomad/docs/drivers/docker#plugin-options)  
    - check the infra_image attr, from the docs it appears nomad hardcodes it to 3.1
 
# A
- see https://github.com/hashicorp/nomad/issues/12588
  - details exactly wtf we need to do with nomad service discovery
  - see https://developer.hashicorp.com/nomad/docs/job-specification/template#nomad-integration
  - see https://developer.hashicorp.com/nomad/tutorials/load-balancing/load-balancing-haproxy
  - see https://discuss.hashicorp.com/t/nomad-1-4-and-haproxy-server-template-without-consul-and-its-dns-feature/44499/2
---
- issue 1: chatter across allocations
- this was expected, as config is pretty much copypasted from the docker convert env file
- core-consul (see below) hostname doesnt exist in validation
  - ^ it needs to point to the core-consul allocation ip
  - ^ or somehow discover on which client core-consul is allocated
- sanity check: 
  - set static port allocations for all core-consul (especially serf) ports
  - hard code  core consul addr in core proxy retry_join attr
  - makes sense that it works with hardcoded values: since everythings running on my machine
  - still a useful sanity check
- real fix: discovery....
```sh
--
2023-01-27T02:16:17.643Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=intention-match error="No known Consul servers" index=0
2023-01-27T02:16:17.643Z [ERROR] agent.proxycfg: Failed to handle update from watch: kind=connect-proxy proxy=core-proxy-1-sidecar-proxy service_id=core-proxy-1-sidecar-proxy id=intentions error="error filling agent cache: No known Consul servers"
--
2023-01-27T02:15:12.560Z [INFO]  agent.client.serf.lan: serf: Attempting re-join to previously known node: core-vault-247bb920bc1a: 172.21.0.2:8301
2023-01-27T02:15:12.918Z [INFO]  agent: (LAN) joining: lan_addresses=["core-consul"]
2023-01-27T02:15:12.941Z [WARN]  agent.router.manager: No servers available
2023-01-27T02:15:12.978Z [WARN]  agent.client.memberlist.lan: memberlist: Failed to resolve core-consul: lookup core-consul on 192.168.0.1:53: no such host
2023-01-27T02:15:12.978Z [WARN]  agent: (LAN) couldn't join: number_of_nodes=0
  error=
  | 1 error occurred:
  | 	* Failed to resolve core-consul: lookup core-consul on 192.168.0.1:53: no such host
  | 
  
2023-01-27T02:15:12.978Z [WARN]  agent: Join cluster failed, will retry: cluster=LAN retry_interval=10s
  error=
  | 1 error occurred:
  | 	* Failed to resolve core-consul: lookup core-consul on 192.168.0.1:53: no such host
  | 
```

-- issue: token/acl

```sh
--
2023-01-27T04:28:12.087Z [INFO]  agent.client.serf.lan: serf: Attempting re-join to previously known node: core-proxy-da6a390b2832: 172.22.0.3:8301
127.0.0.1:53492 [27/Jan/2023:04:28:12.107] edge forward_https/serverhttps 1/-1/+0 +0 -- 1/1/0/0/1 0/0
2023-01-27T04:28:13.388Z [ERROR] agent.client: RPC failed to server: method=Coordinate.Update server=172.26.65.117:8300 error="rpc error making call: Permission denied: token with AccessorID 'bdad85af-9fc8-e41d-593f-c73cebef40fc' lacks permission 'node:write' on \"core-proxy-4652f5c62fdf\""
2023-01-27T04:28:13.388Z [WARN]  agent: Coordinate update blocked by ACLs: accessorID=bdad85af-9fc8-e41d-593f-c73cebef40fc
--
127.0.0.1:39828 [27/Jan/2023:04:28:22.108] edge forward_https/serverhttps 1/-1/+0 +0 -- 1/1/0/0/1 0/0
2023-01-27T04:28:22.580Z [ERROR] agent.client: RPC failed to server: method=Catalog.Register server=172.26.65.117:8300 error="rpc error making call: Permission denied: token with AccessorID 'bdad85af-9fc8-e41d-593f-c73cebef40fc' lacks permission 'node:write' on \"core-proxy-4652f5c62fdf\""
2023-01-27T04:28:22.580Z [WARN]  agent: Node info update blocked by ACLs: node=3bb036b6-c034-7abb-42df-01c8f7a5b1ea accessorID=bdad85af-9fc8-e41d-593f-c73cebef40fc
--
```
---
- issue: vault backend
- this makes sense because vault has been commented

```sh

[NOTICE]   (15) : haproxy version is 2.7.1-3e4af0e
[NOTICE]   (15) : path to executable is /usr/local/sbin/haproxy
[WARNING]  (15) : config : [/var/lib/haproxy/configs/002-001-vault.cfg:19] : 'server lb-vault/core-vault-c-dns1' : could not resolve address 'core-vault.service.search', disabling server.
[WARNING]  (15) : config : [/var/lib/haproxy/configs/002-001-vault.cfg:20] : 'server lb-vault/core-vault-d-dns1' : could not resolve address 'core-vault', disabling server.
[NOTICE]   (15) : New worker (71) forked
[NOTICE]   (15) : Loading success.

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nomad: deep dive into networking #30

C

T

A

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

nomad: deep dive into networking #30

Description

C

T

A

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions