How to create service in docker swarm

by Vikrant
December 28, 2017

In this article I am going to shed light on the service creation part in docker swarm. service consists of set of tasks basically task is just a container.

Step 1 : As I don’t want to spin up any task on the manager nodes hence I drain the manager nodes.

docker@manager1:~$ docker node update --availability drain manager1
manager1
docker@manager1:~$ docker node update --availability drain manager2
manager2
docker@manager1:~$ docker node update --availability drain manager3
manager3

Step 2 : Started the service and once the service is created inspect the service to get the basic information about the service.

docker@manager1:~$ docker service create --replicas 1 --name helloworld alpine ping docker.com
0dg3f4piq87pd3msg7jt3vpvg

docker@manager1:~$ docker service inspect --pretty helloworld

ID:             0dg3f4piq87pd3msg7jt3vpvg
Name:           helloworld
Service Mode:   Replicated
 Replicas:      1
Placement:
UpdateConfig:
 Parallelism:   1
 On failure:    pause
 Monitoring Period: 5s
 Max failure ratio: 0
 Update order:      stop-first
RollbackConfig:
 Parallelism:   1
 On failure:    pause
 Monitoring Period: 5s
 Max failure ratio: 0
 Rollback order:    stop-first
ContainerSpec:
 Image:         alpine:latest@sha256:ccba511b1d6b5f1d83825a94f9d5b05528db456d9cf14a1ea1db892c939cda64
 Args:          ping docker.com
Resources:
Endpoint Mode:  vip

Step 3 : Once the service is started, it will spin-up the tasks on various worker nodes. In this case we have started only one replica hence one container is running on worker1 node.

docker@manager1:~$ docker service ps helloworld
ID                  NAME                IMAGE               NODE                DESIRED STATE       CURRENT STATE            ERROR               PORTS
i6np2a4k4m45        helloworld.1        alpine:latest       worker1             Running             Running 33 seconds ago    

Step 4 : It’s very easy to scale the service by change the scale count. Checking the content shows that helloworld service is now having 3 tasks. Two of them are running on worker1 and one is running on worker2.

docker@manager1:~$ docker service scale helloworld=3
helloworld scaled to 3

docker@manager1:~$ docker service ps helloworld
ID                  NAME                IMAGE               NODE                DESIRED STATE       CURRENT STATE             ERROR               PORTS
i6np2a4k4m45        helloworld.1        alpine:latest       worker1             Running             Running 58 seconds ago    
y5hfkwjwo6a0        helloworld.2        alpine:latest       worker2             Running             Preparing 3 seconds ago   
assn4ao8wgfv        helloworld.3        alpine:latest       worker1             Running             Running 3 seconds ago     

Step 5 : I am starting another service using replica count of 3. Notably I am specifiying “–update-delay” flag it will configure the delay between updates to a service task or set of tasks. I am using set of tasks because you can specify another option i.e “–update-parallelis” flag which indicates the number of tasks which will be updated simultanesouly by default it’s updating the task one by one.

docker@manager2:~$ docker service create --replicas 3 --name redis --update-delay 10s redis:3.0.6
v55rrm1uzzf160liccdwnklhe

Step 6 : Verify the nodes on which the container is started.

docker@manager2:~$ docker service ls
ID                  NAME                MODE                REPLICAS            IMAGE               PORTS
0dg3f4piq87p        helloworld          replicated          3/3                 alpine:latest
v55rrm1uzzf1        redis               replicated          3/3                 redis:3.0.6

docker@manager2:~$ docker service ps redis
ID                  NAME                IMAGE               NODE                DESIRED STATE       CURRENT STATE                ERROR               PORTS
f17y9ss7j660        redis.1             redis:3.0.6         worker1             Running             Running about a minute ago
wsm6ve007ojw        redis.2             redis:3.0.6         worker2             Running             Running about a minute ago
zamwx5kiisvt        redis.3             redis:3.0.6         worker2             Running             Running about a minute ago

Step 7 : Update the image version. You just need to issue one command to perform the upgrade, it will automatically perform the upgrade the update in rolling manner.

docker@manager2:~$ docker service update --image redis:3.0.7 redis
redis

docker@manager2:~$ docker service ps redis
ID                  NAME                IMAGE               NODE                DESIRED STATE       CURRENT STATE                 ERROR               PORTS
7ah6jvmw1zas        redis.1             redis:3.0.7         worker1             Running             Running about a minute ago
f17y9ss7j660         \_ redis.1         redis:3.0.6         worker1             Shutdown            Shutdown about a minute ago
x03a3zdcxrti        redis.2             redis:3.0.7         worker2             Running             Running about a minute ago
wsm6ve007ojw         \_ redis.2         redis:3.0.6         worker2             Shutdown            Shutdown about a minute ago
azrku5q6xnf6        redis.3             redis:3.0.7         worker2             Running             Running 53 seconds ago
zamwx5kiisvt         \_ redis.3         redis:3.0.6         worker2             Shutdown            Shutdown 53 seconds ago

Useful docker swarm cluster administration commands

by Vikrant
December 27, 2017

In the previous articles I have shown the procedure to create swarm cluster and run the services. In this article I am going to provide some tips for docker swarm cluster admins which they can use in daily life.

Q : After creating the cluster initially if we want to join the nodes later how can we do it?

A : We can see the token while initializing creating the cluster but how to join the manager and worker nodes later on in the same cluster just in case if you didn’t make the note of terminal output while creating the cluster.

To get the token for joining node as worker or manager node.

$ docker swarm join-token worker
$ docker swarm join-token manager

Example outputs:

docker@manager1:~$ docker swarm join-token worker
To add a worker to this swarm, run the following command:

    docker swarm join --token SWMTKN-1-1icv5c0oj5p95syuuwiw0hh8ay4k7krrmp94u8urmis7gthylu-3y637fkgufrn3a84pnocgbug8 192.168.99.100:2377

docker@manager1:~$ docker swarm join-token manager
To add a manager to this swarm, run the following command:

    docker swarm join --token SWMTKN-1-1icv5c0oj5p95syuuwiw0hh8ay4k7krrmp94u8urmis7gthylu-a5ycxnd5wf4bivzzkyifopf36 192.168.99.100:2377

We can use these commands to join the worker or manager node. Like if I want to add worker node into the cluster.

docker@worker2:~$ docker swarm join --token SWMTKN-1-1icv5c0oj5p95syuuwiw0hh8ay4k7krrmp94u8urmis7gthylu-3y637fkgufrn3a84pnocgbug8 192.168.99.100:2
377
This node joined a swarm as a worker.

docker@manager1:~$ docker node ls
ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS
xx648gzarpf33fruo8cqbg8dc *   manager1            Ready               Active              Leader
j1g6ylbezhhq6d6qwri85vkgq     worker1             Ready               Active
egy0qr3u15qr8p26p3wpt6ctr     worker2             Ready               Active

Similarly to join the manager node into the cluster. In following ouptut we can see that one of the manager node is acting like Leader and other nodes are just showing reachable state.

docker@manager2:~$ docker swarm join --token SWMTKN-1-1icv5c0oj5p95syuuwiw0hh8ay4k7krrmp94u8urmis7gthylu-a5ycxnd5wf4bivzzkyifopf36 192.168.99.100:2377
This node joined a swarm as a manager.

docker@manager3:~$ docker swarm join --token SWMTKN-1-1icv5c0oj5p95syuuwiw0hh8ay4k7krrmp94u8urmis7gthylu-a5ycxnd5wf4bivzzkyifopf36 192.168.99.100:2377
This node joined a swarm as a manager.

docker@manager3:~$ docker node ls
ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS
xx648gzarpf33fruo8cqbg8dc     manager1            Ready               Active              Leader
9w5zd1lnxab8sin6tbozu0vm3     manager2            Ready               Active              Reachable
z0shyispea9f19j9u2oti89u5 *   manager3            Ready               Active              Reachable
j1g6ylbezhhq6d6qwri85vkgq     worker1             Ready               Active
egy0qr3u15qr8p26p3wpt6ctr     worker2             Ready               Active

Q : How can we remove the node from cluster?

A : First you need to know which node you are planning to leave from the cluster whether it’s worker or manager node. If it’s manager node then first change the node from manager to worker node, you can also directly remove the manager node from the cluster but this will not reconfigure the swarm cluster to maintain quorum in the cluster. In case of worker node, direct command can be issued to take it out from cluster.

Demoted the manager node and now node is not showing in Reachable state. Basically the node has turned into a worker node.

docker@manager2:~$ docker node demote manager2
Manager manager2 demoted in the swarm.

docker@manager1:~$ docker node ls
ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS
xx648gzarpf33fruo8cqbg8dc *   manager1            Ready               Active              Leader
9w5zd1lnxab8sin6tbozu0vm3     manager2            Ready               Active
z0shyispea9f19j9u2oti89u5     manager3            Ready               Active              Reachable
j1g6ylbezhhq6d6qwri85vkgq     worker1             Ready               Active
egy0qr3u15qr8p26p3wpt6ctr     worker2             Ready               Active

Once I issued the command on manager2 node to leave the cluster. Node is in down status now.

docker@manager2:~$ docker swarm leave
Node left the swarm.

docker@manager1:~$ docker node ls
ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS
xx648gzarpf33fruo8cqbg8dc *   manager1            Ready               Active              Leader
9w5zd1lnxab8sin6tbozu0vm3     manager2            Down                Active
z0shyispea9f19j9u2oti89u5     manager3            Ready               Active              Reachable
j1g6ylbezhhq6d6qwri85vkgq     worker1             Ready               Active
egy0qr3u15qr8p26p3wpt6ctr     worker2             Ready               Active

In case of worker node just issue the docker swarm leave command to take that node out of the cluster.

Q : By default while creating service containers are getting started on manager node as well which is not an ideal sceanrio because it can choke up the manager node and leads to other issues. How can stop scheduling containers on manager node?

A : You need to drain the manager node to avoid the scheduling of containers on the manager node. Issuing above command will move the containers to other available nodes.

docker@manager1:~$ docker node update --availability drain manager1

Q : How to promote a worker node to act like a manager node?

A : As we left with two manager node after performing the drain operation on manager node, we can promot the worker node to act like a manager node. As soon as you promote a worker node to act like a manager node it’s showing reachable status.

docker@manager1:~$ docker node promote worker1
Node worker1 promoted to a manager in the swarm.
docker@manager1:~$ docker node ls
ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS
xx648gzarpf33fruo8cqbg8dc *   manager1            Ready               Active              Reachable
9w5zd1lnxab8sin6tbozu0vm3     manager2            Down                Active
z0shyispea9f19j9u2oti89u5     manager3            Ready               Active              Leader
j1g6ylbezhhq6d6qwri85vkgq     worker1             Ready               Active              Reachable
egy0qr3u15qr8p26p3wpt6ctr     worker2             Ready               Active

Q : How to move all the containers running on the worker node to other available nodes?

A : Just drain the worker node so that all running containers can be moved to other available nodes.

docker@manager1:~$ docker node update --availability drain worker2
worker2
docker@manager1:~$ docker node ls
ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS
xx648gzarpf33fruo8cqbg8dc *   manager1            Ready               Active              Reachable
9w5zd1lnxab8sin6tbozu0vm3     manager2            Down                Active
z0shyispea9f19j9u2oti89u5     manager3            Ready               Active              Leader
j1g6ylbezhhq6d6qwri85vkgq     worker1             Ready               Active              Reachable
egy0qr3u15qr8p26p3wpt6ctr     worker2             Ready               Drain

You may switch it back to active once the activity is completed on worker2.

docker@manager1:~$ docker node update --availability active worker2
worker2
docker@manager1:~$ docker node ls
ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS
xx648gzarpf33fruo8cqbg8dc *   manager1            Ready               Active              Reachable
9w5zd1lnxab8sin6tbozu0vm3     manager2            Down                Active
z0shyispea9f19j9u2oti89u5     manager3            Ready               Active              Leader
j1g6ylbezhhq6d6qwri85vkgq     worker1             Ready               Active              Reachable
egy0qr3u15qr8p26p3wpt6ctr     worker2             Ready               Active

Q : How to disable the scheduling of new containers on worker node but want to keep the existing containers running?

A : Put the node in pause mode so that new scheduling will not happen on worker2 node.

docker@manager1:~$ docker node update --availability pause worker2
worker2

docker@manager1:~$ docker node ls
ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS
xx648gzarpf33fruo8cqbg8dc *   manager1            Ready               Active              Reachable
9w5zd1lnxab8sin6tbozu0vm3     manager2            Down                Active
z0shyispea9f19j9u2oti89u5     manager3            Ready               Active              Leader
j1g6ylbezhhq6d6qwri85vkgq     worker1             Ready               Active              Reachable
egy0qr3u15qr8p26p3wpt6ctr     worker2             Ready               Pause

Q : How to remove the node from the list of nodes shown in docker node ls output?

A : Once the swarm worker node is removed from the cluster, it’s showing the DOWN status in output of docker node ls output.

Like my manager node was removed from the swarm cluster and again I issued the swarm join command on that node. Now I am seeing two instances of the manager2 node one which is in DOWN status and another one which is in Ready/Active status.

docker@manager1:~$ docker node ls
ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS
xx648gzarpf33fruo8cqbg8dc *   manager1            Ready               Active              Reachable
9w5zd1lnxab8sin6tbozu0vm3     manager2            Down                Active
i87iw5cs98vmbxxn6umu4zh72     manager2            Ready               Active              Reachable
z0shyispea9f19j9u2oti89u5     manager3            Ready               Active              Leader
j1g6ylbezhhq6d6qwri85vkgq     worker1             Ready               Active
egy0qr3u15qr8p26p3wpt6ctr     worker2             Ready               Active

To get rid of the manager2 Down node. Note: I have used the node ID as HOSTNAME

docker@manager1:~$ docker node rm 9w5zd1lnxab8sin6tbozu0vm3
9w5zd1lnxab8sin6tbozu0vm3

docker@manager1:~$ docker node ls
ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS
xx648gzarpf33fruo8cqbg8dc *   manager1            Ready               Active              Reachable
i87iw5cs98vmbxxn6umu4zh72     manager2            Ready               Active              Reachable
z0shyispea9f19j9u2oti89u5     manager3            Ready               Active              Leader
j1g6ylbezhhq6d6qwri85vkgq     worker1             Ready               Active
egy0qr3u15qr8p26p3wpt6ctr     worker2             Ready               Active

Q : How to add or remove the label metadata from the node?

A : Node label provides a flexible method of node organization. You can also use node labels in service constraints.

~~~

How to troubleshoot docker swarm networking issue

by Vikrant
December 25, 2017

Continuing the docker swarm series, in previous articles, I have discussed about the docker swarm cluster creation and some basic networking of docker swarm. In this article, we will dig more into the docker swarm network troubleshooting. This article is based on the awesome work done by Sreenivas (https://sreeninet.wordpress.com/2017/11/02/docker-networking-tip-troubleshooting/).

To show the network troubleshooting, I am going to start a vote application with replication count of 2 and a client container.

Step 1 : Create overlay network which will be used to start the vote and client application.

docker@manager1:~$ docker network create -d overlay overlay1

Step 2 : Following commands are used for starting the vote and client service. I have used overlay1 network to start the services which I have created in previous step.

docker@manager1:~$ docker service create --replicas 1 --name client --network overlay1 smakam/myubuntu:v4 sleep infinity
mnaihkfak6kqpdvwmy4j95esg

docker@manager1:~$ docker service create --mode replicated --replicas 2 --name vote --network overlay1 --publish mode=ingress,target=80,published=8080 instavote/vote
2z5g8tn1ohnpnd82n41koqtvh

Step 3 : Verify that services are started succesfully. We can use the following commands to verify that on which node services are started.

docker@manager1:~$ docker service ls
ID                  NAME                MODE                REPLICAS            IMAGE                   PORTS
mnaihkfak6kq        client              replicated          1/1                 smakam/myubuntu:v4
2z5g8tn1ohnp        vote                replicated          0/2                 instavote/vote:latest   *:8080->80/tcp

docker@manager1:~$ docker service ps client
ID                  NAME                IMAGE                NODE                DESIRED STATE       CURRENT STATE            ERROR               PORTS
nnieiubwgj30        client.1            smakam/myubuntu:v4   manager1            Running             Running 14 minutes ago

docker@manager1:~$ docker service ps vote
ID                  NAME                IMAGE                   NODE                DESIRED STATE       CURRENT STATE            ERROR               PORTS
njrh2ofmap06        vote.1              instavote/vote:latest   worker1             Running             Running 13 minutes ago
ij6fmmcx71po        vote.2              instavote/vote:latest   manager1            Running             Running 13 minutes ago

Step 4 : Let’s check the containers which are running on manager node.

docker@manager1:~$ docker ps
CONTAINER ID        IMAGE                   COMMAND                  CREATED             STATUS              PORTS               NAMES
1c06334e3589        instavote/vote:latest   "gunicorn app:app ..."   5 hours ago         Up 5 hours          80/tcp              vote.2.ij6fmmcx71po5sjeessn9wl0j
d38975856cfd        smakam/myubuntu:v4      "sleep infinity"         5 hours ago         Up 5 hours                              client.1.nnieiubwgj30js73iwzxfleoe

Step 5 : Login into the each container to see the number of network interfaces present inside the container. Two containers (one client and one vote) are running on manager1 node and one container (vote) is running on worker1 node.

Client container have two interfaces, eth0 interface from overlay1 and eth1 interface from docker_gwbridge network. Keep a note of the IP address (10.0.0.2/32) assigned on loopback interface of client container, this is a VIP.

docker@manager1:~$ docker exec -it d38975856cfd sh
# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet 10.0.0.2/32 scope global lo
       valid_lft forever preferred_lft forever
23: eth0@if24: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
    link/ether 02:42:0a:00:00:03 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.3/24 scope global eth0
       valid_lft forever preferred_lft forever
25: eth1@if26: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:ac:12:00:03 brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.3/16 scope global eth1
       valid_lft forever preferred_lft forever

vote application container have three interfaces, eth0 interface with ingress network,eth1 interface from docker_gwbridge and eth2 interface from overlay network. Again make a note of the virtual IPs present on loopback interface. Two VIPs are present 10.255.0.4/32 and 10.0.0.4/32 on lo interface, one is from ingress network and other one from overlay network.

docker@manager1:~$ docker exec -it 1c06334e3589 sh
/app # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet 10.255.0.4/32 scope global lo
       valid_lft forever preferred_lft forever
    inet 10.0.0.4/32 scope global lo
       valid_lft forever preferred_lft forever
27: eth0@if28: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1450 qdisc noqueue state UP
    link/ether 02:42:0a:ff:00:06 brd ff:ff:ff:ff:ff:ff
    inet 10.255.0.6/16 scope global eth0
       valid_lft forever preferred_lft forever
29: eth1@if30: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP
    link/ether 02:42:ac:12:00:04 brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.4/16 scope global eth1
       valid_lft forever preferred_lft forever
31: eth2@if32: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1450 qdisc noqueue state UP
    link/ether 02:42:0a:00:00:06 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.6/24 scope global eth2
       valid_lft forever preferred_lft forever
	   
Similarly on worker1 node. 	   
	   
docker@worker1:~$ docker exec -it 3ab467754bf5 sh
/app # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet 10.255.0.4/32 scope global lo
       valid_lft forever preferred_lft forever
    inet 10.0.0.4/32 scope global lo
       valid_lft forever preferred_lft forever
18: eth0@if19: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1450 qdisc noqueue state UP
    link/ether 02:42:0a:ff:00:05 brd ff:ff:ff:ff:ff:ff
    inet 10.255.0.5/16 scope global eth0
       valid_lft forever preferred_lft forever
20: eth1@if21: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP
    link/ether 02:42:ac:12:00:03 brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.3/16 scope global eth1
       valid_lft forever preferred_lft forever
23: eth2@if24: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1450 qdisc noqueue state UP
    link/ether 02:42:0a:00:00:05 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.5/24 scope global eth2
       valid_lft forever preferred_lft forever

Question may arise why we have two VIPs assigned for vote service, vote service can be accessed using two methods, either from client or from the host machine. When it’s accessed from the client then myoverlay network is used to access the service. If it’s accessed using host machine then ingress routing mesh network is used to provide the access.

Step 6 : Performing inspect on vote service shows us the VIPs which are used to access the service. These are the same IPs which are assigned on lo interface of vote application.

docker@manager1:~$ docker service inspect vote
	   
            "VirtualIPs": [
                {
                    "NetworkID": "sh0h7as3prit3pd8nhqdbv6x3",
                    "Addr": "10.255.0.4/16"
                },
                {
                    "NetworkID": "uhtur2cqnefffagaih0q5hpbp",
                    "Addr": "10.0.0.4/24"
                }
            ]

Step 7 : Logged into the client container and tried to access the vote applications, if you see in the curl output, traffic is getting load balanced to two containers. Note : we haven’t used the IPaddress to access the containers as swarm automatically provides the DNS discovery.

docker@manager1:~$ docker exec -it d38975856cfd sh	   

# curl vote | grep "container ID"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3162  100  3162    0     0   151k      0 --:--:-- --:--:-- --:--:--  154k
          Processed by container ID 1c06334e3589
#  curl vote | grep "container ID"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3162  100  3162    0     0   266k      0 --:--:-- --:--:-- --:--:--  280k
          Processed by container ID 3ab467754bf5

Step 8 : While performing curl command, if you run the tcpdump on overlay interface then we can see the back and forth traffic, as network troubleshooting tools are not present inside the default vote application image hence I have attached netshoot container with vote application container running on manager1 node. We can see the tcpdump traffic while performing curl on vote application.

docker@manager1:~$ docker run -it --net container:1c06334e3589 nicolaka/netshoot
/ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet 10.255.0.4/32 scope global lo
       valid_lft forever preferred_lft forever
    inet 10.0.0.4/32 scope global lo
       valid_lft forever preferred_lft forever
27: eth0@if28: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
    link/ether 02:42:0a:ff:00:06 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.255.0.6/16 scope global eth0
       valid_lft forever preferred_lft forever
29: eth1@if30: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:ac:12:00:04 brd ff:ff:ff:ff:ff:ff link-netnsid 1
    inet 172.18.0.4/16 scope global eth1
       valid_lft forever preferred_lft forever
31: eth2@if32: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
    link/ether 02:42:0a:00:00:06 brd ff:ff:ff:ff:ff:ff link-netnsid 2
    inet 10.0.0.6/24 scope global eth2
       valid_lft forever preferred_lft forever


/ # tcpdump -s0 -i eth2 -n
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth2, link-type EN10MB (Ethernet), capture size 262144 bytes
14:52:47.525076 IP 10.0.0.3.48366 > 10.0.0.5.80: Flags [S], seq 3360028736, win 28200, options [mss 1410,sackOK,TS val 2853665 ecr 0,nop,wscale 7], length 0
14:52:52.112751 IP 10.0.0.3.48368 > 10.0.0.6.80: Flags [S], seq 1598409449, win 28200, options [mss 1410,sackOK,TS val 2854124 ecr 0,nop,wscale 7], length 0
14:52:52.112780 IP 10.0.0.6.80 > 10.0.0.3.48368: Flags [S.], seq 3328134166, ack 1598409450, win 27960, options [mss 1410,sackOK,TS val 2854124 ecr 2854124,nop,wscale 7], length 0
14:52:52.112806 IP 10.0.0.3.48368 > 10.0.0.6.80: Flags [.], ack 1, win 221, options [nop,nop,TS val 2854124 ecr 2854124], length 0
14:52:52.113945 IP 10.0.0.3.48368 > 10.0.0.6.80: Flags [P.], seq 1:69, ack 1, win 221, options [nop,nop,TS val 2854124 ecr 2854124], length 68: HTTP: GET / HTTP/1.1
14:52:52.114077 IP 10.0.0.6.80 > 10.0.0.3.48368: Flags [.], ack 69, win 219, options [nop,nop,TS val 2854124 ecr 2854124], length 0
14:52:52.126706 IP 10.0.0.6.80 > 10.0.0.3.48368: Flags [P.], seq 1:210, ack 69, win 219, options [nop,nop,TS val 2854125 ecr 2854124], length 209: HTTP: HTTP/1.1 200 OK
14:52:52.126766 IP 10.0.0.3.48368 > 10.0.0.6.80: Flags [.], ack 210, win 229, options [nop,nop,TS val 2854125 ecr 2854125], length 0
14:52:52.127107 IP 10.0.0.6.80 > 10.0.0.3.48368: Flags [P.], seq 210:3372, ack 69, win 219, options [nop,nop,TS val 2854125 ecr 2854125], length 3162: HTTP
14:52:52.127133 IP 10.0.0.3.48368 > 10.0.0.6.80: Flags [.], ack 3372, win 279, options [nop,nop,TS val 2854125 ecr 2854125], length 0
14:52:52.130372 IP 10.0.0.3.48368 > 10.0.0.6.80: Flags [F.], seq 69, ack 3372, win 279, options [nop,nop,TS val 2854126 ecr 2854125], length 0
14:52:52.134746 IP 10.0.0.6.80 > 10.0.0.3.48368: Flags [F.], seq 3372, ack 70, win 219, options [nop,nop,TS val 2854126 ecr 2854126], length 0
14:52:52.134799 IP 10.0.0.3.48368 > 10.0.0.6.80: Flags [.], ack 3373, win 279, options [nop,nop,TS val 2854126 ecr 2854126], length 0

Step 9 : Let’s check how the traffic is getting redirected from the client to vote application containers. First get the sandbox ID associated with vote application container.

docker@manager1:~$ docker container inspect d38975856cfd | grep -i sandbox
            "SandboxID": "727bc5c9f5d0099c1c119e8faead35c9eee37159eeb8017bebd31c5e0f815944",
            "SandboxKey": "/var/run/docker/netns/727bc5c9f5d0",

Use this sandbox ID to get the internals for networking routing in docker swarm.

root@manager1:/var/run/docker/netns# ls
1-sh0h7as3pr  1-uhtur2cqne  727bc5c9f5d0  a557f907c3f2  default       ingress_sbox

Starting netshoot container and attaching it with network namespace in privileged mode.

docker@manager1:~$ docker run -it --rm -v /var/run/docker/netns:/var/run/docker/netns --privileged=true nicolaka/netshoot
/ # nsenter --net=/var/run/docker/netns/727bc5c9f5d0 sh

iptables are showing the MARK which it’s sets on any traffic hitting IP address 10.0.0.4 (remember this is one of the VIP assigned to vote service from myoverlay1 network) in this case it’s setting HEX value of 0x103.

/ # iptables -t mangle -nvL
Chain PREROUTING (policy ACCEPT 60 packets, 23741 bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain INPUT (policy ACCEPT 60 packets, 23741 bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain OUTPUT (policy ACCEPT 105 packets, 6489 bytes)
 pkts bytes target     prot opt in     out     source               destination
    0     0 MARK       all  --  *      *       0.0.0.0/0            10.0.0.2             MARK set 0x101
   39  2484 MARK       all  --  *      *       0.0.0.0/0            10.0.0.4             MARK set 0x103

Chain POSTROUTING (policy ACCEPT 66 packets, 4005 bytes)
 pkts bytes target     prot opt in     out     source               destination

Converting hex value of 0x103 to decimal gives 259. Two backend IP addresses present corresponds to each vote container IP.

/ # ipvsadm
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
FWM  257 rr
  -> 10.0.0.3:0                   Masq    1      0          0
FWM  259 rr
  -> 10.0.0.5:0                   Masq    1      0          0
  -> 10.0.0.6:0                   Masq    1      0          0

In this article, we have covered how the vote application is accessed using client continer, in next article, we will be covering how the same application is accessed from the host machine.

How to access the docker

by Vikrant
December 25, 2017

In the previous article, we have seen a way to access the vote application using client container, in this article we will be accessing the same application using the host machine IP address.

When we are running a application using host mode then container port is getting exposed to host machine, irrespective of on which node container is running, iptable rule for host to container port mapping is added on all the nodes present in cluster. All is referring the general situation in which manager nodes also acts like worker nodes.

Step 1 : Let’s check what iptable rule is added corresponding to service creation.

Remember the command which we have used for service creation.

docker@manager1:~$ docker service create --mode replicated --replicas 2 --name vote --network overlay1 --publish mode=ingress,target=80,published=8080 instavote/vote

Corresponding to above command, two containers were spinned, one of manager1 and other one on worker1 node. iptable rule is inserted on all the nodes present in cluster.

docker@manager1:~$ sudo iptables -t nat -L | grep 8080
DNAT       tcp  --  anywhere             anywhere             tcp dpt:webcache to:172.18.0.2:8080

root@worker1:~# iptables -t nat -L | grep 8080
DNAT       tcp  --  anywhere             anywhere             tcp dpt:webcache to:172.18.0.2:8080

root@worker2:~# iptables -t nat -L | grep 8080
DNAT       tcp  --  anywhere             anywhere             tcp dpt:webcache to:172.18.0.2:8080

Step 2 : Let’s start a container using ingress_sbox namespace.

docker@manager1:~$ docker run -it --rm -v /var/run/docker/netns:/var/run/docker/netns --privileged=true nicolaka/netshoot
/ # nsenter --net=/var/run/docker/netns/ingress_sbox sh
/ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
9: eth0@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
    link/ether 02:42:0a:ff:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.255.0.2/16 scope global eth0
       valid_lft forever preferred_lft forever
12: eth1@if13: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:ac:12:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 1
    inet 172.18.0.2/16 scope global eth1
       valid_lft forever preferred_lft forever

Checking the iptable and mangle rules shows that any traffic which is meant for “10.255.0.4” is getting marked with “0x102” again which when converted to decimal gives 258. ipvs redirects the traffic to container on ip addresses from ingress network.

/ # iptables -t nat -L
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination

Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
DOCKER_OUTPUT  all  --  anywhere             127.0.0.11
DNAT       icmp --  anywhere             10.255.0.4           icmp echo-request to:127.0.0.1

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
DOCKER_POSTROUTING  all  --  anywhere             127.0.0.11
SNAT       all  --  anywhere             10.255.0.0/16        ipvs to:10.255.0.2

Chain DOCKER_OUTPUT (1 references)
target     prot opt source               destination
DNAT       tcp  --  anywhere             127.0.0.11           tcp dpt:domain to:127.0.0.11:42792
DNAT       udp  --  anywhere             127.0.0.11           udp dpt:domain to:127.0.0.11:40831

Chain DOCKER_POSTROUTING (1 references)
target     prot opt source               destination
SNAT       tcp  --  127.0.0.11           anywhere             tcp spt:42792 to::53
SNAT       udp  --  127.0.0.11           anywhere             udp spt:40831 to::53
	   
	   
/ # iptables -t mangle -L
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
MARK       tcp  --  anywhere             anywhere             tcp dpt:http-alt MARK set 0x102

Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
MARK       all  --  anywhere             10.255.0.4           MARK set 0x102

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination

/ # ipvsadm
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
FWM  258 rr
  -> 10.255.0.5:0                 Masq    1      0          0
  -> 10.255.0.6:0                 Masq    1      0          0	   

Step 3 : We can take the tcpdump on interface of vote application container to prove the same while performing curl from host machine.

docker@worker1:~$ curl localhost:8080

docker@manager1:~$ docker run -it --net container:1c06334e3589 nicolaka/netshoot
/ # tcpdump -s0 -i eth0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
08:50:25.616481 IP 10.255.0.3.56472 > 10.255.0.6.8080: Flags [S], seq 3646612060, win 43690, options [mss 65495,sackOK,TS val 3733770 ecr 0,nop,wscale 7], length 0
08:50:25.616528 ARP, Request who-has 10.255.0.3 tell 10.255.0.6, length 28
08:50:25.616543 ARP, Reply 10.255.0.3 is-at 02:42:0a:ff:00:03 (oui Unknown), length 28
08:50:25.616546 IP 10.255.0.6.8080 > 10.255.0.3.56472: Flags [S.], seq 3177214998, ack 3646612061, win 27960, options [mss 1410,sackOK,TS val 3745357 ecr 3733770,nop,wscale 7], length 0
08:50:25.617022 IP 10.255.0.3.56472 > 10.255.0.6.8080: Flags [.], ack 1, win 342, options [nop,nop,TS val 3733770 ecr 3745357], length 0
08:50:25.617338 IP 10.255.0.3.56472 > 10.255.0.6.8080: Flags [P.], seq 1:79, ack 1, win 342, options [nop,nop,TS val 3733770 ecr 3745357], length 78: HTTP: GET / HTTP/1.1
08:50:25.617354 IP 10.255.0.6.8080 > 10.255.0.3.56472: Flags [.], ack 79, win 219, options [nop,nop,TS val 3745357 ecr 3733770], length 0

How to create docker swarm cluster

by Vikrant
December 23, 2017

In this article I am going to provide the steps to create docker swarm setup using one manager and two worker nodes. This is my first article in swarm series, mainly in rest of articles I will be focusing on network part of docker swarm.

A brief introduction to docker swarm, it’s orchestrator similar to kubernetes which is used to spin-up the container on multiple nodes and manage the lifecycle of those containers.

Setup requirements

  • Windows/Linux/MAC machine
  • Oracle Virtual box
  • docker-machine executable

We require to issue only few commands to setup a docker swarm cluster.

Steps to create docker swarm cluster

Step 1 : To get the docker-machine for windows.

# curl -L https://github.com/docker/machine/releases/download/v0.13.0/docker-machine-`uname -s`-`uname -m` >/usr/local/bin/docker-machine &&  chmod +x /usr/local/bin/docker-machine

Step 2 : First spin up the manager and worker nodes. As in this case we are using one manager and two worker nodes hence we will be issuing same command three times with different machine names.

As we are using virtulbox to spin the manager and worker nodes hence driver virtualbox is used.

$ docker-machine.exe create --driver virtualbox manager1
Running pre-create checks...
(manager1) Default Boot2Docker ISO is out-of-date, downloading the latest release...
(manager1) Latest release for github.com/boot2docker/boot2docker is v17.09.1-ce
(manager1) Downloading C:\Users\viaggarw\.docker\machine\cache\boot2docker.iso from https://github.com/boot2docker/boot2docker/releases/download/v17.09.1-ce/boot2docker.iso...
(manager1) 0%....10%....20%....30%....40%....50%....60%....70%....80%....90%....100%
Creating machine...
(manager1) Copying C:\Users\viaggarw\.docker\machine\cache\boot2docker.iso to C:\Users\viaggarw\.docker\machine\machines\manager1\boot2docker.iso...
(manager1) Creating VirtualBox VM...
(manager1) Creating SSH key...
(manager1) Starting the VM...
(manager1) Check network to re-create if needed...
(manager1) Windows might ask for the permission to configure a dhcp server. Sometimes, such confirmation window is minimized in the taskbar.
(manager1) Waiting for an IP...
Waiting for machine to be running, this may take a few minutes...
Detecting operating system of created instance...
Waiting for SSH to be available...
Detecting the provisioner...
Provisioning with boot2docker...
Copying certs to the local machine directory...
Copying certs to the remote machine...
Setting Docker configuration on the remote daemon...
Checking connection to Docker...
Docker is up and running!
To see how to connect your Docker Client to the Docker Engine running on this virtual machine, run: C:\Users\viaggarw\bin\docker-machine.exe env manager1

As indicated in the above output issue the following command to get the information on how to connect with manager node.

$ docker-machine.exe env manager1
export DOCKER_TLS_VERIFY="1"
export DOCKER_HOST="tcp://192.168.99.100:2376"
export DOCKER_CERT_PATH="C:\Users\viaggarw\.docker\machine\machines\manager1"
export DOCKER_MACHINE_NAME="manager1"
export COMPOSE_CONVERT_WINDOWS_PATHS="true"
# Run this command to configure your shell:
# eval $("C:\Users\viaggarw\bin\docker-machine.exe" env manager1)

Similarly we need to spinup the worker nodes. Outputs are truncated for brevity.

# docker-machine create --driver virtualbox worker1
# docker-machine create --driver virtualbox worker2

Verify that all nodes have came up successfully.

$ docker-machine.exe ls
NAME       ACTIVE   DRIVER       STATE     URL                         SWARM   DOCKER        ERRORS
manager1   -        virtualbox   Running   tcp://192.168.99.100:2376           v17.09.1-ce
worker1    -        virtualbox   Running   tcp://192.168.99.101:2376           v17.09.1-ce
worker2    -        virtualbox   Running   tcp://192.168.99.102:2376           v17.09.1-ce

Step 3 : Login into the manager1 node to initalize the swarm cluster.

# docker-machine ssh manager1

Before directly initializing the cluster, first verify the current number of interfaces which are present on machine.

docker@manager1:~$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 0a:12:9b:d5:e3:17 brd ff:ff:ff:ff:ff:ff
3: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:91:de:a2 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.15/24 brd 10.0.2.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fe91:dea2/64 scope link
       valid_lft forever preferred_lft forever
4: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:28:f2:f8 brd ff:ff:ff:ff:ff:ff
    inet 192.168.99.100/24 brd 192.168.99.255 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fe28:f2f8/64 scope link
       valid_lft forever preferred_lft forever
6: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:61:2b:60:62 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 scope global docker0
       valid_lft forever preferred_lft forever

Initialized the swarm cluster.

docker@manager1:~$ docker swarm init --advertise-addr 192.168.99.100
Swarm initialized: current node (xx648gzarpf33fruo8cqbg8dc) is now a manager.

To add a worker to this swarm, run the following command:

    docker swarm join --token SWMTKN-1-1icv5c0oj5p95syuuwiw0hh8ay4k7krrmp94u8urmis7gthylu-3y637fkgufrn3a84pnocgbug8 192.168.99.100:2377

To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.

After issuing above command, you may have noticed that two new interfaces docker_gwbridge, veth903e7a7@if12 have appeared.

docker@manager1:~$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 0a:12:9b:d5:e3:17 brd ff:ff:ff:ff:ff:ff
3: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:91:de:a2 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.15/24 brd 10.0.2.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fe91:dea2/64 scope link
       valid_lft forever preferred_lft forever
4: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:28:f2:f8 brd ff:ff:ff:ff:ff:ff
    inet 192.168.99.100/24 brd 192.168.99.255 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fe28:f2f8/64 scope link
       valid_lft forever preferred_lft forever
6: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:61:2b:60:62 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 scope global docker0
       valid_lft forever preferred_lft forever
11: docker_gwbridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:f3:1c:75:05 brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.1/16 scope global docker_gwbridge
       valid_lft forever preferred_lft forever
    inet6 fe80::42:f3ff:fe1c:7505/64 scope link
       valid_lft forever preferred_lft forever
13: veth903e7a7@if12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker_gwbridge state UP group default
    link/ether e2:eb:2f:41:b8:34 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::e0eb:2fff:fe41:b834/64 scope link
       valid_lft forever preferred_lft forever
$ exit	   

Step 4 : Our manager node is configured, it’s time to add the worker nodes into swarm cluster. As indicated in the output of docker swarm init --advertise-addr 192.168.99.100 we need to issue the following command on worker node to join them into the swarm cluster. Again before directly adding them into swarm cluster just verify the interfaces which are present on worker node before and after joining the swarm cluster.

Before joining the swarm cluster interfaces present on worker1 node.

$ docker-machine.exe ssh worker1
                        ##         .
                  ## ## ##        ==
               ## ## ## ## ##    ===
           /"""""""""""""""""\___/ ===
      ~~~ {~~ ~~~~ ~~~ ~~~~ ~~~ ~ /  ===- ~~~
           \______ o           __/
             \    \         __/
              \____\_______/
 _                 _   ____     _            _
| |__   ___   ___ | |_|___ \ __| | ___   ___| | _____ _ __
| '_ \ / _ \ / _ \| __| __) / _` |/ _ \ / __| |/ / _ \ '__|
| |_) | (_) | (_) | |_ / __/ (_| | (_) | (__|   <  __/ |
|_.__/ \___/ \___/ \__|_____\__,_|\___/ \___|_|\_\___|_|
Boot2Docker version 17.09.1-ce, build HEAD : e7de9ae - Fri Dec  8 19:41:36 UTC 2017
Docker version 17.09.1-ce, build 19e2cf6
docker@worker1:~$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 76:08:a0:a4:04:69 brd ff:ff:ff:ff:ff:ff
3: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:eb:6b:cf brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.15/24 brd 10.0.2.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:feeb:6bcf/64 scope link
       valid_lft forever preferred_lft forever
4: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:c9:b5:11 brd ff:ff:ff:ff:ff:ff
    inet 192.168.99.101/24 brd 192.168.99.255 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fec9:b511/64 scope link
       valid_lft forever preferred_lft forever
6: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:2b:ac:36:32 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 scope global docker0
       valid_lft forever preferred_lft forever
docker@worker1:~$

Issue the command to join the worker node into swarm cluster. Again two new interfaces appeared on worker node as well. Similar command need to be issued on worker2 node to make it part of swarm cluster.

docker@worker1:~$ docker swarm join --token SWMTKN-1-1icv5c0oj5p95syuuwiw0hh8ay4k7krrmp94u8urmis7gthylu-3y637fkgufrn3a84pnocgbug8 192.168.99.100:2377
This node joined a swarm as a worker.
docker@worker1:~$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 76:08:a0:a4:04:69 brd ff:ff:ff:ff:ff:ff
3: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:eb:6b:cf brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.15/24 brd 10.0.2.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:feeb:6bcf/64 scope link
       valid_lft forever preferred_lft forever
4: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:c9:b5:11 brd ff:ff:ff:ff:ff:ff
    inet 192.168.99.101/24 brd 192.168.99.255 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fec9:b511/64 scope link
       valid_lft forever preferred_lft forever
6: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:2b:ac:36:32 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 scope global docker0
       valid_lft forever preferred_lft forever
11: docker_gwbridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:b7:94:7c:66 brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.1/16 scope global docker_gwbridge
       valid_lft forever preferred_lft forever
    inet6 fe80::42:b7ff:fe94:7c66/64 scope link tentative
       valid_lft forever preferred_lft forever
13: veth5025aab@if12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker_gwbridge state UP group default
    link/ether 92:c6:0e:6e:75:b0 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::90c6:eff:fe6e:75b0/64 scope link
       valid_lft forever preferred_lft forever

How networking works in multi-node docker swarm cluster

by Vikrant
December 23, 2017

In this article I am going to shed some light on the multi-node docker networking in context of docker swarm. In the previous article I have provided the setps to create docker swarm article, in this we will be focusing on networking part.

As you may have noticed that after initializing the swarm manager node two new interfaces docker_gwbridge, veth903e7a7@if12 were appearted in ip a output. These two interfaces are corresponding to two new docker networks docker_gwbridge, ingress which are appearing in following output.

docker@manager1:~$ docker network ls
NETWORK ID          NAME                DRIVER              SCOPE
7b26301a39ae        bridge              bridge              local
689d9c67a71e        docker_gwbridge     bridge              local
61b3deedf21d        host                host                local
sh0h7as3prit        ingress             overlay             swarm
c6e3be0b1751        none                null                local

Let’s understand the usage of each of the network present in above output.

1) Name : bridge, Driver : bridge - When we are spinning up the container on a single node this is the default bridge which is used to provide the IP address to container. This is the docker0 interface present corresponding to this network.

docker@manager1:~$ ip a show dev docker0
6: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:61:2b:60:62 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 scope global docker0
       valid_lft forever preferred_lft forever

If I simply spin up the container using docker run command then it will take the ipaddress from docker0 subnet range.

/ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
20: eth0@if21: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.17.0.2/16 scope global eth0
       valid_lft forever preferred_lft forever

2) Name : docker_gwbridge Driver : bridge - This network doesn’t present in default configuration, it appears after initializing the swarm cluster. This bridge network that connects overlay networks (including the ingress network) to an individual Docker daemon’s physical network.

docker@manager1:~$ ip a show dev docker_gwbridge
11: docker_gwbridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:f3:1c:75:05 brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.1/16 scope global docker_gwbridge
       valid_lft forever preferred_lft forever
    inet6 fe80::42:f3ff:fe1c:7505/64 scope link
       valid_lft forever preferred_lft forever

Creating service using docker service command and check on which nodes containers are running corresponding to this service. Note : I have given replica count of 2 before running this service.

docker@manager1:~$ docker service create --replicas 2 --publish mode=ingress,target=80,published=8080 nginx
n61t0fc7hyxh5aypahwhy4159

docker@manager1:~$ docker service ls
ID                  NAME                MODE                REPLICAS            IMAGE               PORTS
n61t0fc7hyxh        frosty_jepsen       replicated          2/2                 nginx:latest        *:8080->80/tcp

docker@manager1:~$ docker service ps frosty_jepsen
ID                  NAME                IMAGE               NODE                DESIRED STATE       CURRENT STATE            ERROR               PORTS
7mw7m5xvh660        frosty_jepsen.1     nginx:latest        worker1             Running             Running 9 seconds ago
rp6wmkcq5liv        frosty_jepsen.2     nginx:latest        manager1            Running             Running 26 seconds ago

Let’s check the IP addresses which are present on container running on manager1 node. It has taken two IP addresses one from the docker_gwbridge and another from ingress range which we have not discussed yet.

/ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet 10.255.0.4/32 scope global lo
       valid_lft forever preferred_lft forever
16: eth0@if17: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
    link/ether 02:42:0a:ff:00:06 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.255.0.6/16 scope global eth0
       valid_lft forever preferred_lft forever
18: eth1@if19: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:ac:12:00:03 brd ff:ff:ff:ff:ff:ff link-netnsid 1
    inet 172.18.0.3/16 scope global eth1
       valid_lft forever preferred_lft forever

3) Name : host Driver : host - This is used to provide host machine IP address to the container. It’s not of much use but taking an example let’s say you don’t want to install the network troubleshooting related packages on your host machine and wants to spin-up a container which can provide those packages, in that scenario we can map the container containing network related packages with host machine network.

On manager1 node I don’t have tcpdump package present.

docker@manager1:~$ tcpdu

Spin-up container using host network map. All the interfaces which were present on manager1 are now present inside the netshoot container.

docker@manager1:~$ docker run -it --net host nicolaka/netshoot
/ # ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 0a:12:9b:d5:e3:17 brd ff:ff:ff:ff:ff:ff
3: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:91:de:a2 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.15/24 brd 10.0.2.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fe91:dea2/64 scope link
       valid_lft forever preferred_lft forever
4: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:28:f2:f8 brd ff:ff:ff:ff:ff:ff
    inet 192.168.99.100/24 brd 192.168.99.255 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fe28:f2f8/64 scope link
       valid_lft forever preferred_lft forever
6: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:61:2b:60:62 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:61ff:fe2b:6062/64 scope link
       valid_lft forever preferred_lft forever
11: docker_gwbridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:f3:1c:75:05 brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.1/16 scope global docker_gwbridge
       valid_lft forever preferred_lft forever
    inet6 fe80::42:f3ff:fe1c:7505/64 scope link
       valid_lft forever preferred_lft forever
13: veth903e7a7@if12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker_gwbridge state UP group default
    link/ether e2:eb:2f:41:b8:34 brd ff:ff:ff:ff:ff:ff link-netnsid 1
    inet6 fe80::e0eb:2fff:fe41:b834/64 scope link
       valid_lft forever preferred_lft forever
19: veth8d2dc83@if18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker_gwbridge state UP group default
    link/ether 62:d8:e3:98:00:a2 brd ff:ff:ff:ff:ff:ff link-netnsid 2
    inet6 fe80::60d8:e3ff:fe98:a2/64 scope link
       valid_lft forever preferred_lft forever
21: veth9d7aff1@if20: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default
    link/ether 1e:71:40:d0:ed:55 brd ff:ff:ff:ff:ff:ff link-netnsid 3
    inet6 fe80::1c71:40ff:fed0:ed55/64 scope link
       valid_lft forever preferred_lft forever

We can use the network related tools present inside the container.

/ # tcpdump

Once troubleshooting is completed then we can destroy the container. Our host machine remains prisitne.

4) Name: ingress Driver: overlay - This is used for multi-node communication. We have already seen in step 2, IP address provided by the overlay network to container along with IPaddress from dockergw_bridge network. Basically dockergw_bridge is used to provide the external connectivity to containers and overlay is used to provide connectivity between the containers.

5) Name: none Driver: null - When we don’t want to attach any IPaddress or network to container then we use none driver to spin-up the container.

How to use ovn-trace for troubleshooting openflows

by Vikrant
September 21, 2017

In my last post, I have shown the usage of ovn-trace to trace the logical flows, new versions of ovs-ovn also includes the option “–ovs” which will show the openflows along with logical flows. But as I am using ovs-ovn 2.7 version which doesn’t include the “–ovs” option hence I am showing the trace command of openflows in this article.

Setup Info

Three instances are running in my setup. testinstance1 and testinstance2 are from same subnet but running on different compute nodes.

[root@controller ~(keystone_admin)]# nova list --fields name,status,host,networks
+--------------------------------------+---------------+--------+----------+--------------------------------------+
| ID                                   | Name          | Status | Host     | Networks                             |
+--------------------------------------+---------------+--------+----------+--------------------------------------+
| 69736780-e0cc-46d4-a1f7-f0fac7e1cf54 | testinstance1 | ACTIVE | compute1 | internal1=10.10.10.4, 192.168.122.54 |
| 278b5a14-8ae6-4e91-870e-35f6230ed48a | testinstance2 | ACTIVE | compute2 | internal1=10.10.10.10                |
| 8683b0c2-6685-4aff-9549-c69311b57238 | testinstance3 | ACTIVE | compute2 | internal2=10.10.11.4                 |
+--------------------------------------+---------------+--------+----------+--------------------------------------+

Checking the interface MAC address for both instances.

[root@controller ~(keystone_admin)]# nova interface-list testinstance1
+------------+--------------------------------------+--------------------------------------+--------------+-------------------+
| Port State | Port ID                              | Net ID                               | IP addresses | MAC Addr          |
+------------+--------------------------------------+--------------------------------------+--------------+-------------------+
| ACTIVE     | 0bc5e22d-bd80-4cac-a9b3-51c0d0b284d1 | 89113f8b-bc01-46b1-84fb-edd5d606879c | 10.10.10.4   | fa:16:3e:55:52:80 |
+------------+--------------------------------------+--------------------------------------+--------------+-------------------+

[root@controller ~(keystone_admin)]# nova interface-list testinstance2
+------------+--------------------------------------+--------------------------------------+--------------+-------------------+
| Port State | Port ID                              | Net ID                               | IP addresses | MAC Addr          |
+------------+--------------------------------------+--------------------------------------+--------------+-------------------+
| ACTIVE     | 84645ee6-8efa-435e-b93a-73cc173364ba | 89113f8b-bc01-46b1-84fb-edd5d606879c | 10.10.10.10  | fa:16:3e:ef:50:3e |
+------------+--------------------------------------+--------------------------------------+--------------+-------------------+

Tracing the flow from compute1 to compute2.

compute1 (Node hosting source instance)

Step 1 : Check the port on which the testinstance1 is connected on compute1.

[root@compute1 ~]# ovs-ofctl dump-ports-desc br-int
OFPST_PORT_DESC reply (xid=0x2):
 6(ovn-418009-0): addr:1e:ed:12:10:65:13
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 7(ovn-07e20a-0): addr:52:fd:4b:fb:01:7b
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 8(tap0bc5e22d-bd): addr:fe:16:3e:55:52:80             <<<< port8
     config:     0
     state:      0
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 9(patch-br-int-to): addr:06:58:63:01:95:4b
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 LOCAL(br-int): addr:16:1c:f5:2e:46:40
     config:     PORT_DOWN
     state:      LINK_DOWN
     speed: 0 Mbps now, 0 Mbps max

Step 2 : Start the trace on source compute node using in_port value as 8, specifying the source and destination MAC address.

[root@compute1 ~]# ovs-appctl ofproto/trace br-int 'in_port=8,dl_src=fa:16:3e:55:52:80,dl_dst=fa:16:3e:ef:50:3e'
Flow: in_port=8,vlan_tci=0x0000,dl_src=fa:16:3e:55:52:80,dl_dst=fa:16:3e:ef:50:3e,dl_type=0x0000

bridge("br-int")
----------------
 0. in_port=8, priority 100
    set_field:0x1->reg13
    set_field:0x7->reg11
    set_field:0x4->reg12
    set_field:0x5->metadata
    set_field:0x2->reg14
    resubmit(,16)
16. reg14=0x2,metadata=0x5,dl_src=fa:16:3e:55:52:80, priority 50, cookie 0x446fb031
    resubmit(,17)
17. metadata=0x5, priority 0, cookie 0x6f65cadf
    resubmit(,18)
18. metadata=0x5, priority 0, cookie 0xbc120a28
    resubmit(,19)
19. metadata=0x5, priority 0, cookie 0xe2858a64
    resubmit(,20)
20. metadata=0x5, priority 0, cookie 0xa498a2d8
    resubmit(,21)
21. metadata=0x5, priority 0, cookie 0xaed27663
    resubmit(,22)
22. metadata=0x5, priority 0, cookie 0x3d43b8c1
    resubmit(,23)
23. metadata=0x5, priority 0, cookie 0x8d414703
    resubmit(,24)
24. metadata=0x5, priority 0, cookie 0x141e41a7
    resubmit(,25)
25. metadata=0x5, priority 0, cookie 0x3dc1b849
    resubmit(,26)
26. metadata=0x5, priority 0, cookie 0x8e786a4e
    resubmit(,27)
27. metadata=0x5, priority 0, cookie 0xd702291a
    resubmit(,28)
28. metadata=0x5, priority 0, cookie 0x2eb48ab4
    resubmit(,29)
29. metadata=0x5,dl_dst=fa:16:3e:ef:50:3e, priority 50, cookie 0xf62ef55f
    set_field:0x3->reg15
    resubmit(,32)
32. reg15=0x3,metadata=0x5, priority 100
    load:0x5->NXM_NX_TUN_ID[0..23]
    set_field:0x3->tun_metadata0
    move:NXM_NX_REG14[0..14]->NXM_NX_TUN_METADATA0[16..30]
     -> NXM_NX_TUN_METADATA0[16..30] is now 0x2
    output:6
     -> output to kernel tunnel

Final flow: reg11=0x7,reg12=0x4,reg13=0x1,reg14=0x2,reg15=0x3,tun_id=0x5,metadata=0x5,in_port=8,vlan_tci=0x0000,dl_src=fa:16:3e:55:52:80,dl_dst=fa:16:3e:ef:50:3e,dl_type=0x0000
Megaflow: recirc_id=0,ct_state=-new-est-rel-rpl-inv-trk,ct_label=0/0x1,tun_id=0/0xffffff,tun_metadata0=NP,in_port=8,vlan_tci=0x0000/0x1000,dl_src=fa:16:3e:55:52:80,dl_dst=fa:16:3e:ef:50:3e,dl_type=0x0000
Datapath actions: set(tunnel(tun_id=0x5,dst=192.168.122.207,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x20003}),flags(df|csum|key))),2

It’s evident from the Datapath actions, packet is tunneled through the geneve tunnel to destination compute node with IP address 192.168.122.207.

set(tunnel(tun_id=0x5,dst=192.168.122.207,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x20003}),flags(df|csum|key))),2

What’s 1tp_dst=6081` this comes from the genev value on destination compute node.

[root@compute2 ~]# ip a show dev genev_sys_6081
6: genev_sys_6081: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65470 qdisc noqueue master ovs-system state UNKNOWN qlen 1000
    link/ether ba:05:5b:b3:a3:92 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::b805:5bff:feb3:a392/64 scope link
       valid_lft forever preferred_lft forever

compute2 (Node hosting destination instance)

Step 1 : In case of OVN based setup br-tun is not present as separate bridge, br-int is taking care of encapsulation and decapsulation. IP address of compute1 is 192.168.122.15, from following output, we can see that port ovn-080677-0 is used as tunnel endpoint on compute2.

[root@compute2 ~]# ovs-vsctl show
0fed4a0e-f49f-488c-8f33-b7a90b9cabd9
    Bridge br-ex
        fail_mode: standalone
        Port "patch-provnet-e411bbe8-e169-4268-b2bf-d5959d9d7260-to-br-int"
            Interface "patch-provnet-e411bbe8-e169-4268-b2bf-d5959d9d7260-to-br-int"
                type: patch
                options: {peer="patch-br-int-to-provnet-e411bbe8-e169-4268-b2bf-d5959d9d7260"}
        Port "ens3"
            Interface "ens3"
        Port br-ex
            Interface br-ex
                type: internal
    Bridge br-int
        fail_mode: secure
        Port "tap84645ee6-8e"
            Interface "tap84645ee6-8e"
        Port br-int
            Interface br-int
                type: internal
        Port "ovn-07e20a-0"
            Interface "ovn-07e20a-0"
                type: geneve
                options: {csum="true", key=flow, remote_ip="192.168.122.39"}
        Port "tapdf575f1c-92"
            Interface "tapdf575f1c-92"
        Port "patch-br-int-to-provnet-e411bbe8-e169-4268-b2bf-d5959d9d7260"
            Interface "patch-br-int-to-provnet-e411bbe8-e169-4268-b2bf-d5959d9d7260"
                type: patch
                options: {peer="patch-provnet-e411bbe8-e169-4268-b2bf-d5959d9d7260-to-br-int"}
        Port "ovn-080677-0"         
            Interface "ovn-080677-0"
                type: geneve
                options: {csum="true", key=flow, remote_ip="192.168.122.15"}
    ovs_version: "2.7.2"

Step 2 : Let’s find the port number of this decapsulation endpoint on compute2 in br-int bridge. This port is attached to port 2 of br-int bridge. Also our destination instance is connected on port 7.

[root@compute2 ~]# ovs-ofctl dump-ports-desc br-ex
OFPST_PORT_DESC reply (xid=0x2):
 1(ens3): addr:52:54:00:74:a5:b1
     config:     0
     state:      0
     current:    100MB-FD AUTO_NEG
     advertised: 10MB-HD 10MB-FD 100MB-HD 100MB-FD COPPER AUTO_NEG AUTO_PAUSE
     supported:  10MB-HD 10MB-FD 100MB-HD 100MB-FD COPPER AUTO_NEG
     speed: 100 Mbps now, 100 Mbps max
 3(patch-provnet-e): addr:f2:5e:6b:eb:a8:8b
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 LOCAL(br-ex): addr:52:54:00:74:a5:b1
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max

[root@compute2 ~]# ovs-ofctl dump-ports-desc br-int
OFPST_PORT_DESC reply (xid=0x2):
 2(ovn-080677-0): addr:7a:a3:da:49:a0:59
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 6(ovn-07e20a-0): addr:42:33:99:83:fa:16
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 7(tap84645ee6-8e): addr:fe:16:3e:ef:50:3e
     config:     0
     state:      0
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 8(patch-br-int-to): addr:5e:92:33:e2:41:36
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 9(tapdf575f1c-92): addr:fe:16:3e:12:69:18
     config:     0
     state:      0
     current:    10MB-FD COPPER
     speed: 10 Mbps now, 0 Mbps max
 LOCAL(br-int): addr:f2:26:c3:c2:17:45
     config:     PORT_DOWN
     state:      LINK_DOWN
     speed: 0 Mbps now, 0 Mbps max	 	   

Step 2 : Checking the openflow rule on br-int, any in_port traffic on port2 will be redirected to table 33

[root@compute2 ~]# ovs-ofctl dump-flows br-int | grep in_port
 cookie=0x0, duration=520954.078s, table=0, n_packets=99505, n_bytes=9751490, idle_age=0, hard_age=65534, priority=100,in_port=2 actions=move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],move:NXM_NX_TUN_METADATA0[16..30]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_METADATA0[0..15]->NXM_NX_REG15[0..15],resubmit(,33)
 cookie=0x0, duration=326970.620s, table=0, n_packets=3320, n_bytes=325360, idle_age=65534, hard_age=65534, priority=100,in_port=6 actions=move:NXM_NX_TUN_ID[0..23]->OXM_OF_METADATA[0..23],move:NXM_NX_TUN_METADATA0[16..30]->NXM_NX_REG14[0..14],move:NXM_NX_TUN_METADATA0[0..15]->NXM_NX_REG15[0..15],resubmit(,33)
 cookie=0x0, duration=49546.114s, table=0, n_packets=3209, n_bytes=309122, idle_age=0, priority=100,in_port=7 actions=load:0x1->NXM_NX_REG13[],load:0x7->NXM_NX_REG11[],load:0x4->NXM_NX_REG12[],load:0x5->OXM_OF_METADATA[],load:0x3->NXM_NX_REG14[],resubmit(,16)
 cookie=0x0, duration=48811.570s, table=0, n_packets=14, n_bytes=2060, idle_age=5598, priority=100,in_port=9 actions=load:0xb->NXM_NX_REG13[],load:0xa->NXM_NX_REG11[],load:0x9->NXM_NX_REG12[],load:0x1->OXM_OF_METADATA[],load:0x2->NXM_NX_REG14[],resubmit(,16)
 cookie=0x0, duration=49546.104s, table=0, n_packets=476, n_bytes=54048, idle_age=49, priority=100,in_port=8,vlan_tci=0x0000/0x1000 actions=load:0x8->NXM_NX_REG13[],load:0x2->NXM_NX_REG11[],load:0x5->NXM_NX_REG12[],load:0x6->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],resubmit(,16)
 cookie=0x0, duration=49546.104s, table=0, n_packets=0, n_bytes=0, idle_age=49546, priority=100,in_port=8,dl_vlan=0 actions=strip_vlan,load:0x8->NXM_NX_REG13[],load:0x2->NXM_NX_REG11[],load:0x5->NXM_NX_REG12[],load:0x6->OXM_OF_METADATA[],load:0x1->NXM_NX_REG14[],resubmit(,16)

Finally from table 65 it will reach the port 7 on which our destination instance is connected.

[root@compute2 ~]# ovs-ofctl dump-flows br-int | grep output:7
 cookie=0x0, duration=47193.887s, table=65, n_packets=772, n_bytes=74816, idle_age=0, priority=100,reg15=0x3,metadata=0x5 actions=output:7

Similarly for return traffic from compute2 instance to compute1 instance can be verified using following command:

[root@compute2 ~]# ovs-appctl ofproto/trace br-int 'in_port=7,dl_src=fa:16:3e:ef:50:3e,dl_dst=fa:16:3e:55:52:80' | grep Datapath
Datapath actions: set(tunnel(tun_id=0x5,dst=192.168.122.15,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x30002}),flags(df|csum|key))),4

How to use ovc-trace to troubleshoot logical flows

by Vikrant
September 20, 2017

In case of OVN, we are talking about logical flows, it’s difficult to go through the flows manually to understand the working, ovn developers understand this pain, and they have created utility ovn-trace to simulate the traffic flow for virtual packets.

ovn-trace can be used to trace both L2 and L3 behavior of traffic.

In case of L2, it requires inport, source MAC and destination MAC to trace the flow. In case of L3, it requires inport, source MAC, source ip address, source MAC and destination MAC to trace the flow.

I am using ovs-ovn version 2.7 which is having some issues with L3 flow tracing, I brought this issue on ovs-discuss list and came to know that these are known issues which are fixed in latest version of ovs-ovn i.e 2.8

Example : L2 logical flow trace.

To show this example, spawned two instances using same network but these instances are running on different compute nodes.

[root@controller ~(keystone_admin)]# nova list --fields name,status,host,networks
+--------------------------------------+---------------+--------+----------+--------------------------------------+
| ID                                   | Name          | Status | Host     | Networks                             |
+--------------------------------------+---------------+--------+----------+--------------------------------------+
| 69736780-e0cc-46d4-a1f7-f0fac7e1cf54 | testinstance1 | ACTIVE | compute1 | internal1=10.10.10.4, 192.168.122.54 |
| 278b5a14-8ae6-4e91-870e-35f6230ed48a | testinstance2 | ACTIVE | compute2 | internal1=10.10.10.10                |
+--------------------------------------+---------------+--------+----------+--------------------------------------+

[root@controller ~(keystone_admin)]# ovn-nbctl show
    switch 0d413d9c-7f23-4ace-9a8a-29817b3b33b5 (neutron-89113f8b-bc01-46b1-84fb-edd5d606879c)
        port 397c019e-9bc3-49d3-ac4c-4aeeb1b3ba3e
            addresses: ["router"]
        port 84645ee6-8efa-435e-b93a-73cc173364ba
            addresses: ["fa:16:3e:ef:50:3e 10.10.10.10"]
        port 0bc5e22d-bd80-4cac-a9b3-51c0d0b284d1
            addresses: ["fa:16:3e:55:52:80 10.10.10.4"]
    switch 1ec08997-0899-40d1-9b74-0a25ef476c00 (neutron-e411bbe8-e169-4268-b2bf-d5959d9d7260)
        port provnet-e411bbe8-e169-4268-b2bf-d5959d9d7260
            addresses: ["unknown"]
        port b95e9ae7-5c91-4037-8d2c-660d4af00974
            addresses: ["router"]
    router 7418a4e7-abff-4af7-85f5-6eea2ede9bea (neutron-67dc2e78-e109-4dac-acce-b71b2c944dc1)
        port lrp-b95e9ae7-5c91-4037-8d2c-660d4af00974
            mac: "fa:16:3e:52:20:7c"
            networks: ["192.168.122.50/24"]
        port lrp-397c019e-9bc3-49d3-ac4c-4aeeb1b3ba3e
            mac: "fa:16:3e:87:28:40"
            networks: ["10.10.10.1/24"]

Tracing the traffic from “10.10.10.4” to “10.10.10.10”. I am using the MAC addresses of the port along with the inport information of the port.

Here input port is for the instance “10.10.10.4”. It’s showing the output port “0bc5e22d-bd80-4cac-a9b3-51c0d0b284d1” automatically in egress output section. If you look closely it’s determining the destination port in ingress rule itself.

[root@controller ~(keystone_admin)]# ovn-trace 0d413d9c-7f23-4ace-9a8a-29817b3b33b5 'inport=="84645ee6-8efa-435e-b93a-73cc173364ba" && eth.src == fa:16:3e:ef:50:3e && eth.dst == fa:16:3e:55:52:80'
# reg14=0x3,vlan_tci=0x0000,dl_src=fa:16:3e:ef:50:3e,dl_dst=fa:16:3e:55:52:80,dl_type=0x0000

ingress(dp="neutron-89113f8b-bc01-46b1-84fb-edd5d606879c", inport="84645ee6-8efa-435e-b93a-73cc173364ba")
---------------------------------------------------------------------------------------------------------
 0. ls_in_port_sec_l2 (ovn-northd.c:2979): inport == "84645ee6-8efa-435e-b93a-73cc173364ba" && eth.src == {fa:16:3e:ef:50:3e}, priority 50, uuid 1ebfeeab
    next;
13. ls_in_l2_lkup (ovn-northd.c:3274): eth.dst == fa:16:3e:55:52:80, priority 50, uuid 8b07b8bf
    outport = "0bc5e22d-bd80-4cac-a9b3-51c0d0b284d1";
    output;

egress(dp="neutron-89113f8b-bc01-46b1-84fb-edd5d606879c", inport="84645ee6-8efa-435e-b93a-73cc173364ba", outport="0bc5e22d-bd80-4cac-a9b3-51c0d0b284d1")
--------------------------------------------------------------------------------------------------------------------------------------------------------
 8. ls_out_port_sec_l2 (ovn-northd.c:3399): outport == "0bc5e22d-bd80-4cac-a9b3-51c0d0b284d1" && eth.dst == {fa:16:3e:55:52:80}, priority 50, uuid 09012b8e
    output;
    /* output to "0bc5e22d-bd80-4cac-a9b3-51c0d0b284d1", type "" */

Example : L3 logical flow trace

Created another network, attach that network to router as port, spawn new instance using that network.

[root@controller ~(keystone_admin)]# nova list --fields name,status,host,networks
+--------------------------------------+---------------+--------+----------+--------------------------------------+
| ID                                   | Name          | Status | Host     | Networks                             |
+--------------------------------------+---------------+--------+----------+--------------------------------------+
| 69736780-e0cc-46d4-a1f7-f0fac7e1cf54 | testinstance1 | ACTIVE | compute1 | internal1=10.10.10.4, 192.168.122.54 |
| 278b5a14-8ae6-4e91-870e-35f6230ed48a | testinstance2 | ACTIVE | compute2 | internal1=10.10.10.10                |
| 8683b0c2-6685-4aff-9549-c69311b57238 | testinstance3 | ACTIVE | compute2 | internal2=10.10.11.4                 |
+--------------------------------------+---------------+--------+----------+--------------------------------------+

[root@controller ~(keystone_admin)]# ovn-nbctl show
    switch 0d413d9c-7f23-4ace-9a8a-29817b3b33b5 (neutron-89113f8b-bc01-46b1-84fb-edd5d606879c)
        port 397c019e-9bc3-49d3-ac4c-4aeeb1b3ba3e
            addresses: ["router"]
        port 84645ee6-8efa-435e-b93a-73cc173364ba
            addresses: ["fa:16:3e:ef:50:3e 10.10.10.10"]
        port 0bc5e22d-bd80-4cac-a9b3-51c0d0b284d1
            addresses: ["fa:16:3e:55:52:80 10.10.10.4"]
    switch f12c50d5-1dad-4e68-9c04-89d4732946a2 (neutron-7a2cf4c3-1476-4a86-8757-8102ec511362)
        port 8db628d6-cf39-4166-bae6-715e71e5a6f5
            addresses: ["router"]
        port df575f1c-9282-4a94-a490-3e570ca02429
            addresses: ["fa:16:3e:12:69:18 10.10.11.4"]
    switch 1ec08997-0899-40d1-9b74-0a25ef476c00 (neutron-e411bbe8-e169-4268-b2bf-d5959d9d7260)
        port provnet-e411bbe8-e169-4268-b2bf-d5959d9d7260
            addresses: ["unknown"]
        port b95e9ae7-5c91-4037-8d2c-660d4af00974
            addresses: ["router"]
    router 7418a4e7-abff-4af7-85f5-6eea2ede9bea (neutron-67dc2e78-e109-4dac-acce-b71b2c944dc1)
        port lrp-b95e9ae7-5c91-4037-8d2c-660d4af00974
            mac: "fa:16:3e:52:20:7c"
            networks: ["192.168.122.50/24"]
        port lrp-8db628d6-cf39-4166-bae6-715e71e5a6f5
            mac: "fa:16:3e:27:66:8f"
            networks: ["10.10.11.1/24"]
        port lrp-397c019e-9bc3-49d3-ac4c-4aeeb1b3ba3e
            mac: "fa:16:3e:87:28:40"
            networks: ["10.10.10.1/24"]

Tracing the traffic from “10.10.11.4” and “10.10.10.4” both instances are running on different compute nodes. As I indicated earlier, version which I am using is having bug due to which it’s not able to trace the logical flow successfully.

[root@controller ~(keystone_admin)]# ovn-trace f12c50d5-1dad-4e68-9c04-89d4732946a2 'inport=="df575f1c-9282-4a94-a490-3e570ca02429" && eth.src == fa:16:3e:12:69:18 && ip4.src == 10.10.11.4 && eth.d st == fa:16:3e:55:52:80 && ip4.dst == 10.10.10.4 && ip.ttl == 32'
# ip,reg14=0x2,vlan_tci=0x0000,dl_src=fa:16:3e:12:69:18,dl_dst=fa:16:3e:55:52:80,nw_src=10.10.11.4,nw_dst=10.10.10.4,nw_proto=0,nw_tos=0,nw_ecn=0,nw_ttl=32

ingress(dp="neutron-7a2cf4c3-1476-4a86-8757-8102ec511362", inport="df575f1c-9282-4a94-a490-3e570ca02429")
---------------------------------------------------------------------------------------------------------
 0. ls_in_port_sec_l2 (ovn-northd.c:2979): inport == "df575f1c-9282-4a94-a490-3e570ca02429" && eth.src == {fa:16:3e:12:69:18}, priority 50, uuid 0de9048a
    next;
 1. ls_in_port_sec_ip (ovn-northd.c:2113): inport == "df575f1c-9282-4a94-a490-3e570ca02429" && eth.src == fa:16:3e:12:69:18 && ip4.src == {10.10.11.4}, priority 90, uuid 8d1c26a6
    next;
 3. ls_in_pre_acl (ovn-northd.c:2397): ip, priority 100, uuid 1b768b1b
    reg0[0] = 1;
    next;
 5. ls_in_pre_stateful (ovn-northd.c:2515): reg0[0] == 1, priority 100, uuid 39f7b20b
    ct_next;
    *** ct_* actions not implemented

This Bug is also applicable when both instances are running on same compute node like testinstance2 and testinstance3

[root@controller ~(keystone_admin)]# ovn-trace f12c50d5-1dad-4e68-9c04-89d4732946a2 'inport=="df575f1c-9282-4a94-a490-3e570ca02429" && eth.src == fa:16:3e:12:69:18 && ip4.src == 10.10.11.4 && eth.d st == fa:16:3e:ef:50:3e && ip4.dst == 10.10.10.10 && ip.ttl == 32'
# ip,reg14=0x2,vlan_tci=0x0000,dl_src=fa:16:3e:12:69:18,dl_dst=fa:16:3e:ef:50:3e,nw_src=10.10.11.4,nw_dst=10.10.10.10,nw_proto=0,nw_tos=0,nw_ecn=0,nw_ttl=32

ingress(dp="neutron-7a2cf4c3-1476-4a86-8757-8102ec511362", inport="df575f1c-9282-4a94-a490-3e570ca02429")
---------------------------------------------------------------------------------------------------------
 0. ls_in_port_sec_l2 (ovn-northd.c:2979): inport == "df575f1c-9282-4a94-a490-3e570ca02429" && eth.src == {fa:16:3e:12:69:18}, priority 50, uuid 0de9048a
    next;
 1. ls_in_port_sec_ip (ovn-northd.c:2113): inport == "df575f1c-9282-4a94-a490-3e570ca02429" && eth.src == fa:16:3e:12:69:18 && ip4.src == {10.10.11.4}, priority 90, uuid 8d1c26a6
    next;
 3. ls_in_pre_acl (ovn-northd.c:2397): ip, priority 100, uuid 1b768b1b
    reg0[0] = 1;
    next;
 5. ls_in_pre_stateful (ovn-northd.c:2515): reg0[0] == 1, priority 100, uuid 39f7b20b
    ct_next;
    *** ct_* actions not implemented

How to use OVN as a mechansim driver in openstack packstack?

by Vikrant
September 20, 2017

In this article, I am going to show the packstack deployment using ovn as mechansim driver. Pike is the first release which introduced the support for ovn in packstack deployment tool. It require just few changes in answer.txt file to make the deployment successful with ovn as mechanism driver.

Question arises what is OVN? OVN stands for Open Virtual Network. It’s replacement for OVS (Openvswitch) mechanism driver, basically OVN is an extended form of OVS which will be taking care of L2 and L3 packet flows using openflows only which was not possible with OVS without network namespaces. Openflows supporting L3 traffic are present on all compute nodes which means that instances with floating IP will be having direct connectivity with external network from compute node itself instead of traversing through the controller nodes similar to DVR. But in case of DVR still the namespaces were coming into the picture which increases the number of hops in the network path. Also, in DVR if the instance is not having floating IP address then traffic has to go through controller nodes because NAT will happen on that node, in OVN new distributed gateway concept is introduced which makes this opertion also happen on one of the compute node instead of controller node. Note: Distributed GW is not running on all compute nodes, only few of them can have it. It’s decided by scheduler on which compute node distributed gateway should be placed.

Let’s what changes we need to made in Openstack packstack answer.txt file to make the deployment using OVN as mechanism driver.

Step 1 : I have done the deployment using one controller and two compute nodes. In the following example, I am just showing the ovn related changes in the answer.txt file for sake of brevity.

[root@controller ~]# grep -i ovn /root/answer.txt | grep -v "#"
CONFIG_NEUTRON_ML2_MECHANISM_DRIVERS=ovn
CONFIG_NEUTRON_L2_AGENT=ovn
CONFIG_NEUTRON_OVN_BRIDGE_MAPPINGS=extnet:br-ex
CONFIG_NEUTRON_OVN_BRIDGE_IFACES=br-ex:ens3
CONFIG_NEUTRON_OVN_BRIDGES_COMPUTE=br-ex
CONFIG_NEUTRON_OVN_EXTERNAL_PHYSNET=extnet
CONFIG_NEUTRON_OVN_TUNNEL_IF=
CONFIG_NEUTRON_OVN_TUNNEL_SUBNETS=

One more change is required to support the new tunnel type geneve.

[root@controller ~(keystone_admin)]# grep 'geneve' /root/answer.txt | grep -v "#"
CONFIG_NEUTRON_ML2_TYPE_DRIVERS=vxlan,flat,geneve
CONFIG_NEUTRON_ML2_TENANT_NETWORK_TYPES=vxlan,geneve

Step 2 : Let’s check which version of OVN is installed after succesful packstack deployment. It’s ovs-ovn 2.7

[root@controller ~(keystone_admin)]# rpm -qa | grep -i ovn
openstack-nova-novncproxy-16.0.0-1.el7.noarch
openvswitch-ovn-common-2.7.2-3.1fc27.el7.x86_64
openvswitch-ovn-host-2.7.2-3.1fc27.el7.x86_64
novnc-0.5.1-2.el7.noarch
puppet-ovn-11.3.0-1.el7.noarch
openvswitch-ovn-central-2.7.2-3.1fc27.el7.x86_64
python-networking-ovn-3.0.0-1.el7.noarch

Verify that mechanism driver is set to ovn.

[root@controller ~(keystone_admin)]# grep ovn /etc/neutron/plugins/ml2/ml2_conf.ini
mechanism_drivers=ovn
[ovn]
ovn_nb_connection=tcp:192.168.122.39:6641
ovn_sb_connection=tcp:192.168.122.39:6642

Everything is commented in ovn networking-ovn.ini file by default.

[root@controller ~(keystone_admin)]# ll /etc/neutron/plugins/networking-ovn/networking-ovn.ini
-rw-r-----. 1 root neutron 3826 Aug 30 07:50 /etc/neutron/plugins/networking-ovn/networking-ovn.ini
[root@controller ~(keystone_admin)]# egrep -v "^(#|$)" /etc/neutron/plugins/networking-ovn/networking-ovn.ini
[DEFAULT]
[ovn]
[root@controller ~(keystone_admin)]#

Step 3 : Create some test internal network, external network and router using the same neutron commands. Add internal network as port in router and set external network as gateway for router.

[root@controller ~(keystone_admin)]# openstack network list
+--------------------------------------+------------------+--------------------------------------+
| ID                                   | Name             | Subnets                              |
+--------------------------------------+------------------+--------------------------------------+
| 89113f8b-bc01-46b1-84fb-edd5d606879c | internal1        | 2936931e-0a8d-43e8-bab2-b2be05feddfe |
| e411bbe8-e169-4268-b2bf-d5959d9d7260 | external_network | 9cc2b26f-3c3e-41f9-be25-e99ac514e2b9 |
+--------------------------------------+------------------+--------------------------------------+
[root@controller ~(keystone_admin)]# openstack router list
+--------------------------------------+---------+--------+-------+-------------+-------+----------------------------------+
| ID                                   | Name    | Status | State | Distributed | HA    | Project                          |
+--------------------------------------+---------+--------+-------+-------------+-------+----------------------------------+
| 67dc2e78-e109-4dac-acce-b71b2c944dc1 | router1 | ACTIVE | UP    | False       | False | a1ba67a2a9b84e4ab5746edcec40dba4 |
+--------------------------------------+---------+--------+-------+-------------+-------+----------------------------------+

Verify the logical OVN topology using command:

[root@controller ~(keystone_admin)]# ovn-nbctl show
    switch 0d413d9c-7f23-4ace-9a8a-29817b3b33b5 (neutron-89113f8b-bc01-46b1-84fb-edd5d606879c)
        port 6fe3cab5-5f84-44c8-90f2-64c21b489c62
            addresses: ["fa:16:3e:fa:d6:d3 10.10.10.9"]
        port 397c019e-9bc3-49d3-ac4c-4aeeb1b3ba3e
            addresses: ["router"]
    switch 1ec08997-0899-40d1-9b74-0a25ef476c00 (neutron-e411bbe8-e169-4268-b2bf-d5959d9d7260)
        port provnet-e411bbe8-e169-4268-b2bf-d5959d9d7260
            addresses: ["unknown"]
        port b95e9ae7-5c91-4037-8d2c-660d4af00974
            addresses: ["router"]
    router 7418a4e7-abff-4af7-85f5-6eea2ede9bea (neutron-67dc2e78-e109-4dac-acce-b71b2c944dc1)
        port lrp-b95e9ae7-5c91-4037-8d2c-660d4af00974
            mac: "fa:16:3e:52:20:7c"
            networks: ["192.168.122.50/24"]
        port lrp-397c019e-9bc3-49d3-ac4c-4aeeb1b3ba3e
            mac: "fa:16:3e:87:28:40"
            networks: ["10.10.10.1/24"]

Two logical switches are connected corresponding to internal1 and external_network. router is created with first IP address from subnet range and this IP will be used for NAT of the instances which are not having floating IP associated with them.

Check out what are the differences present when using ovn as mechanism driver instead of ovs (openvswitch).

  • First of all private networks are created with new tunnel type instead of vxlan.
[root@controller ~(keystone_admin)]# openstack network show internal1
+---------------------------+--------------------------------------+
| Field                     | Value                                |
+---------------------------+--------------------------------------+
| admin_state_up            | UP                                   |
| availability_zone_hints   |                                      |
| availability_zones        |                                      |
| created_at                | 2017-09-03T17:04:31Z                 |
| description               |                                      |
| dns_domain                | None                                 |
| id                        | 89113f8b-bc01-46b1-84fb-edd5d606879c |
| ipv4_address_scope        | None                                 |
| ipv6_address_scope        | None                                 |
| is_default                | None                                 |
| is_vlan_transparent       | None                                 |
| mtu                       | 1442                                 |
| name                      | internal1                            |
| port_security_enabled     | True                                 |
| project_id                | a1ba67a2a9b84e4ab5746edcec40dba4     |
| provider:network_type     | geneve                               |
| provider:physical_network | None                                 |
| provider:segmentation_id  | 63                                   |
| qos_policy_id             | None                                 |
| revision_number           | 3                                    |
| router:external           | Internal                             |
| segments                  | None                                 |
| shared                    | False                                |
| status                    | ACTIVE                               |
| subnets                   | 2936931e-0a8d-43e8-bab2-b2be05feddfe |
| tags                      |                                      |
| updated_at                | 2017-09-03T17:04:39Z                 |
+---------------------------+--------------------------------------+
  • No network namespaces are present corresponding to created networks. That means you would not be able to connect to instances which are not having floating IP address from the network namespace. But this feature is again introduced in later versions.
[root@controller ~(keystone_admin)]# ip netns list
[root@controller ~(keystone_admin)]#
  • What about neutron agent-list? That’s also gone.
[root@controller ~(keystone_admin)]# neutron agent-list
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.

[root@controller ~(keystone_admin)]#

In the next post, we will discuss about the architecture of OVN which will help us to understand why these things are missing from OVN.

Understand OVN architecture

by Vikrant
September 20, 2017

In previous post, we have seen the procedure to use OVN as mechanism driver in packstack installation. In this article, my main focus is to talk about the OVN architecture.

I would suggest you some great reads on this topic given in References section.

OVN is all about logical flows which are present on controller nodes in south DB. These logical flows are getting converted into Openvswitch openflows and getting stored on every compute node. This transition happen using ovn-controller process which is running on all compute nodes. If you are doing a packstack based installation following my previous post, then ovn-controller is also running on controller node along with compute node because in packstack there is no such way to avoid running the ovn-controller on controller node or at-least I didn’t find any. Which means that controller node is also a candidate on which your distributed GW can run.

Following are the OVN related services running on openstack nodes:

Controller Node: As I mentioned earlier ovn-controller in ideal case shouldn’t be running on controller node.

[root@controller ~(keystone_admin)]# ps -ef | grep -w ovn | grep -v grep
root       949     1  0 Sep17 ?        00:00:00 ovn-controller: monitoring pid 950 (healthy)
root       950   949  0 Sep17 ?        00:34:39 ovn-controller unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --no-chdir --log-file=/var/log/openvswitch/ovn-controller.log --pidfile=/var/run/openvswitch/ovn-controller.pid --detach --monitor
root       999     1  0 Sep17 ?        00:00:00 ovn-northd: monitoring pid 1000 (healthy)
root      1000   999  0 Sep17 ?        00:00:00 ovn-northd -vconsole:emer -vsyslog:err -vfile:info --ovnnb-db=unix:/run/openvswitch/ovnnb_db.sock --ovnsb-db=unix:/run/openvswitch/ovnsb_db.sock --no-chdir --log-file=/var/log/openvswitch/ovn-northd.log --pidfile=/run/openvswitch/ovn-northd.pid --detach --monitor

Compute Node: Only ovn-controller is running on compute node.

[root@compute1 ~]# ps -ef | grep -i ovn | grep -v grep
root       895     1  0 Sep13 ?        00:00:00 ovn-controller: monitoring pid 896 (healthy)
root       896   895  0 Sep13 ?        01:10:00 ovn-controller unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --no-chdir --log-file=/var/log/openvswitch/ovn-controller.log --pidfile=/var/run/openvswitch/ovn-controller.pid --detach --monitor

Purpose of each service:

ovn-northd = It transfer the logical topology into logical flows. It’s important to notice that logical flows are different from openflows. All translated information is stored in Southbound DB.

OVN Controller = Running on all compute nodes, to perform logical flow translation into OpenFlow which are then programmed into the local OVS instance.

When OVS is used as mechansim driver, openflows were only responsible for handling the L2 traffic now in case of OVN openflow rules are also used to take the decision of L3 traffic.

Whole objective of OVN was to remove network namespaces which were coming in traffic flow path.

I will dig up more on the flows in coming posts.

References:

[1] http://networkop.co.uk/blog/2016/11/27/ovn-part1/ [2] http://galsagie.github.io/2015/04/20/ovn-1/ [3] https://blog.russellbryant.net/2015/04/08/ovn-and-openstack-integration-development-update/