Data Center Host to ToR Architecture

This chapter discusses the various architectures and strategies available from the top of rack (ToR) switches all the way down to the server hosts.

Layer 2 - Traditional Spanning Tree - Single Attached

Example
Summary
Bond and Etherchannel are not configured on host to multiple switches (bonds can still occur but only to one switch at a time), so leaf01 and leaf02 see two different MAC addresses.
Benefits
Considerations
  • Established technology: Interoperability with other vendors, easy configuration, a lot of documentation from multiple vendors and the industry
  • Ability to use spanning tree commands: PortAdminEdge and BPDU guard
  • Layer 2 reachability to all VMs
  • The load balancing mechanism on the host can cause problems. If there is only host pinning to each NIC, there are no problems, but if you have a bond, you need to look at an MLAG solution.
  • No active-active host links. Some operating systems allow HA (NIC failover), but this still does not utilize all the bandwidth. VMs use one NIC, not two.
Active-Active Mode
Active-Passive Mode
L2 to L3 Demarcation
None (not possible with traditional spanning tree)VRR
  • ToR layer (recommended)
  • Spine layer
  • Core/edge/exit

You can configure VRR on a pair of switches at any level in the network. However, the higher up the network, the larger the layer 2 domain becomes. The benefit is layer 2 reachability. The drawback is that the layer 2 domain is more difficult to troubleshoot, does not scale as well, and the pair of switches running VRR needs to carry the entire MAC address table of everything below it in the network. Cumulus Professional Services recommends minimizing the layer 2 domain as much as possible. For more information, see this presentation.

Example Configuration

auto bridge
iface bridge
  bridge-vlan-aware yes
  bridge-ports swp1 peerlink
  bridge-vids 1-2000
  bridge-stp on

auto bridge.10
iface bridge.10
  address 10.1.10.2/24

auto peerlink
iface peerlink
    bond-slaves glob swp49-50

auto swp1
iface swp1
  mstpctl-portadminedge yes
  mstpctl-bpduguard yes
auto eth1
iface eth1 inet manual

auto eth1.10
iface eth1.10 inet manual

auto eth2
iface eth1 inet manual

auto eth2.20
iface eth2.20 inet manual

auto br-10
iface br-10 inet manual
  bridge-ports eth1.10 vnet0

auto br-20
iface br-20 inet manual
  bridge-ports eth2.20 vnet1

Layer 2 - MLAG

Example
Summary
MLAG (multi-chassis link aggregation) uses both uplinks at the same time. VRR enables both spines to act as gateways simultaneously for HA (high availability) and active-active mode (both are used at the same time).
Benefits
Considerations
100% of links utilized
  • More complicated (more moving parts)
  • More configuration
  • No interoperability between vendors
  • ISL (inter-switch link) required
Active-Active ModeActive-Passive ModeL2 to L3 DemarcationMore Information
VRRNone
  • ToR layer (recommended)
  • Spine layer
  • Core/edge/exit

    Example Configuration

    auto bridge
    iface bridge
      bridge-vlan-aware yes
      bridge-ports host-01 peerlink
      bridge-vids 1-2000
      bridge-stp on
    
    auto bridge.10
    iface bridge.10
      address 172.16.1.2/24
      address-virtual 44:38:39:00:00:10 172.16.1.1/24
    
    auto peerlink
    iface peerlink
        bond-slaves glob swp49-50
    
    auto peerlink.4094
    iface peerlink.4094
        address 169.254.1.2
        clagd-enable yes
        clagd-peer-ip 169.254.1.2
        clagd-system-mac 44:38:39:FF:40:94
    
    auto host-01
    iface host-01
      bond-slaves swp1
      clag-id 1
      {bond-defaults removed for brevity}
    
    auto bond0
    iface bond0 inet manual
      bond-slaves eth0 eth1
      {bond-defaults removed for brevity}
    
    auto bond0.10
    iface bond0.10 inet manual
    
    auto vm-br10
    iface vm-br10 inet manual
      bridge-ports bond0.10 vnet0
    

    Layer 3 - Single-attached Hosts

    Example
    Summary
    The server (physical host) has only has one link to one ToR switch.
    Benefits
    Considerations
    • Relatively simple network configuration
    • No STP
    • No MLAG
    • No layer 2 loops
    • No crosslink between leafs
    • Greater route scaling and flexibility
    • No redundancy for ToR, upgrades can cause downtime.
    • There is often no software to support application layer redundancy.
    FHR (First Hop Redundancy)
    More Information
    No redundancy for ToR, uses single ToR as gateway.For additional bandwidth, links between host and leaf can be bonded.

    Example Configuration

    /etc/network/interfaces file

    auto swp1
    iface swp1
      address 172.16.1.1/30
    

    /etc/frr/frr.conf file

    router ospf
      router-id 10.0.0.11
    interface swp1
      ip ospf area 0
    

    /etc/network/interfaces file

    auto swp1
    iface swp1
      address 172.16.2.1/30
    

    /etc/frr/frr.conf file

    router ospf
      router-id 10.0.0.12
    interface swp1
      ip ospf area 0
    
    auto eth1
    iface eth1 inet static
      address 172.16.1.2/30
      up ip route add 0.0.0.0/0 nexthop via 172.16.1.1
    
    auto eth1
    iface eth1 inet static
      address 172.16.2.2/30
      up ip route add 0.0.0.0/0 nexthop via 172.16.2.1
    

    Layer 3 - Redistribute Neighbor

    Example
    Summary
    The Redistribute neighbor daemon grabs ARP entries dynamically and uses the redistribute table for FRRouting to take these dynamic entries and redistribute them into the fabric.
    Benefits
    Considerations
    • Configuration in FRRouting is simple (route map plus redistribute table)
    • Supported by Cumulus Networks
    • Silent hosts do not receive traffic (depending on ARP).
    • IPv4 only.
    • If two VMs are on the same layer 2 domain, they can learn about each other directly instead of using the gateway, which causes problems (such as VM migration or getting the network routed). Put hosts on /32 (no other layer 2 adjacency).
    • VM moves do not trigger a route withdrawal from the original leaf (four hour timeout).
    • Clearing ARP impacts routing.
    • No layer 2 adjacency between servers without VXLAN.
    FHR (First Hop Redundancy)More Information
    • Equal cost route installed on server, host, or hypervisor to both ToRs to load balance evenly.
    • For host/VM/container mobility, use the same default route on all hosts (such as x.x.x.1) but do not distribute or advertise the .1 on the ToR into the fabric. This allows the VM to use the same gateway no matter to which pair of leafs it is cabled.
    Cumulus Networks blog post introducing redistribute neighbor

    Layer 3 - Routing on the Host

    Example
    Summary
    Routing on the host means there is a routing application (such as FRRouting, either on the bare metal host (no VMs or containers) or the hypervisor (for example, Ubuntu with KVM). This is highly recommended by the Cumulus Networks Professional Services team.
    Benefits
    Considerations
    • No requirement for MLAG
    • No spanning tree or layer 2 domain
    • No loops
    • You can use three or more ToRs instead of the usual two
    • Host and VM mobility
    • You can use traffic engineering to migrate traffic from one ToR to another when upgrading both hardware and software
    • The hypervisor or host OS might not support a routing application like FRRouting and requires a virtual router on the hypervisor.
    • No layer 2 adjacency between servers without VXLAN.
    FHR (First Hop Redundancy)
    More Information
    • The first hop is still the ToR, just like redistribute neighbor
    • A default route can be advertised by all leaf/ToRs for dynamic ECMP paths

    Layer 3 - Routing on the VM

    Example
    Summary
    Instead of routing on the hypervisor, each virtual machine uses its own routing stack.
    Benefits
    Considerations
    In addition to routing on host:
    • The hypervisor/base OS does not need to be able to do routing
    • VMs can be authenticated into routing fabric
    • All VMs must be capable of routing.
    • You need to take scale considerations into an account; instead of one routing process, there are as many as there are VMs.
    • No layer 2 adjacency between servers without VXLAN.
    FHR (First Hop Redundancy)
    More Information
    • The first hop is still the ToR, just like redistribute neighbor
    • You can use multiple ToRs (two or more)

      Layer 3 - Virtual Router

      Example
      Summary
      Virtual router (vRouter) runs as a VM on the hypervisor or host and sends routes to the ToR using BGP or OSPF.
      Benefits
      Considerations
      In addition to routing on a host:
      • Multi-tenancy can work, where multiple customers share the same racks
      • The base OS does not need to be routing capable
      • ECMP might not work correctly (load balancing to multiple ToRs); the Linux kernel in older versions is not capable of ECMP per flow (it does it per packet).
      • No layer 2 adjacency between servers without VXLAN.
      FHR (First Hop Redundancy)
      More Information
      • The gateway is the vRouter, which has two routes out (two ToRs)
      • You can use multiple vRouters

      Layer 3 - Anycast with Manual Redistribution

      Example
      Summary
      In contrast to routing on the host (preferred), this method allows you to route to the host. The ToRs are the gateway, as with redistribute neighbor, except because there is no daemon running, you must manually configure the networks under the routing process. There is a potential to black hole unless you run a script to remove the routes when the host no longer responds.
      Benefits
      Considerations
      • Most benefits of routing on the host
      • No requirement for host to run routing
      • No requirement for redistribute neighbor
      • Removing a subnet from one ToR and re-adding it to another (network statements from your router process) is a manual process.
      • Network team and server team have to be in sync, or the server team controls the ToR, or automation is used used whenever VM migration occurs.
      • When using VMs or containers it is very easy to black hole traffic, as the leafs continue to advertise prefixes even when the VM is down.
      • No layer 2 adjacency between servers without VXLAN.
      FHR (First Hop Redundancy)
      The gateways are the ToRs, exactly like redistribute neighbor with an equal cost route installed.

      Example Configuration

      /etc/network/interfaces file

      auto swp1
      iface swp1
        address 172.16.1.1/30
      

      /etc/frr/frr.conf file

      router ospf
        router-id 10.0.0.11
      interface swp1
        ip ospf area 0
      

      /etc/network/interfaces file

      auto swp2
      iface swp2
        address 172.16.1.1/30
      

      /etc/frr/frr.conf file

      router ospf
        router-id 10.0.0.12
      interface swp1
        ip ospf area 0
      
      auto lo
      iface lo inet loopback
      
      auto lo:1
      iface lo:1 inet static
        address 172.16.1.2/32
        up ip route add 0.0.0.0/0 nexthop via 172.16.1.1 dev eth0 onlink nexthop via 172.16.1.1 dev eth1 onlink
      
      auto eth1
      iface eth2 inet static
        address 172.16.1.2/32
      
      auto eth2
      iface eth2 inet static
        address 172.16.1.2/32
      

      Layer 3 - EVPN with Symmetric VXLAN Routing

      Symmetric VXLAN routing is configured directly on the ToR, using EVPN for both VLAN and VXLAN bridging as well as VXLAN and external routing.

      Each server is configured on a VLAN, with a total of two VLANs for the setup. MLAG is also set up between servers and the leafs. Each leaf is configured with an anycast gateway and the servers default gateways are pointing towards the corresponding leaf switch IP gateway address. Two tenant VNIs (corresponding to two VLANs/VXLANs) are bridged to corresponding VLANs.

      Benefits
      Considerations
      • Layer 2 domain is reduced to the pair of ToRs
      • Aggregation layer is all layer 3 (VLANs do not have to exist on spine switches)
      • Greater route scaling and flexibility
      • High availability
      Needs MLAG (with the same considerations as the MLAG section above).
      Active-Active ModeActive-Passive ModeDemarcationMore Information
      VRRNoneToR layer

      Example /etc/network/interfaces File Configuration

      # Loopback interface
      auto lo
      iface lo inet loopback
        address 10.0.0.11/32
        clagd-vxlan-anycast-ip 10.0.0.112
        alias loopback interface
      
      # Management interface
       auto eth0
       iface eth0 inet dhcp
          vrf mgmt
      
      auto mgmt
      iface mgmt
          address 127.0.0.1/8
          address ::1/128
          vrf-table auto
      
      # Port to Server01
      auto swp1
      iface swp1
        alias to Server01
        # This is required for Vagrant only
        post-up ip link set swp1 promisc on
      
      # Port to Server02
      auto swp2
      iface swp2
        alias to Server02
        # This is required for Vagrant only
        post-up ip link set swp2 promisc on
      
      # Port to Leaf02
      auto swp49
      iface swp49
        alias to Leaf02
        # This is required for Vagrant only
        post-up ip link set swp49 promisc on
      
      # Port to Leaf02
      auto swp50
      iface swp50
        alias to Leaf02
        # This is required for Vagrant only
        post-up ip link set swp50 promisc on
      
      # Port to Spine01
      auto swp51
      iface swp51
        mtu 9216
        alias to Spine01
      
      # Port to Spine02
      auto swp52
      iface swp52
        mtu 9216
        alias to Spine02
      
      # MLAG Peerlink bond
      auto peerlink
      iface peerlink
        mtu 9000
        bond-slaves swp49 swp50
      
      # MLAG Peerlink L2 interface.
      # This creates VLAN 4094 that only lives on the peerlink bond
      # No other interface will be aware of VLAN 4094
      auto peerlink.4094
      iface peerlink.4094
        address 169.254.1.1/30
        clagd-peer-ip 169.254.1.2
        clagd-backup-ip 10.0.0.12
        clagd-sys-mac 44:39:39:ff:40:94
        clagd-priority 100
      
      # Bond to Server01
      auto bond01
      iface bond01
        mtu 9000
        bond-slaves swp1
        bridge-access 13
        clag-id 1
      
      # Bond to Server02
      auto bond02
      iface bond02
        mtu 9000
        bond-slaves swp2
        bridge-access 24
        clag-id 2
      
      # Define the bridge for STP
      auto bridge
      iface bridge
        bridge-vlan-aware yes
        # bridge-ports includes all ports related to VxLAN and CLAG.
        # does not include the Peerlink.4094 subinterface
        bridge-ports bond01 bond02 peerlink vni13 vni24 vxlan4001
        bridge-vids 13 24
        bridge-pvid 1
      
      # VXLAN Tunnel for Server1-Server3 (Vlan 13)
      auto vni13
      iface vni13
        mtu 9000
        vxlan-id 13
        vxlan-local-tunnelip 10.0.0.11
        bridge-access 13
        mstpctl-bpduguard yes
        mstpctl-portbpdufilter yes
      
      #VXLAN Tunnel for Server2-Server4 (Vlan 24)
      auto vni24
      iface vni24
        mtu 9000
        vxlan-id 24
        vxlan-local-tunnelip 10.0.0.11
        bridge-access 24
        mstpctl-bpduguard yes
        mstpctl-portbpdufilter yes
      
      auto vxlan4001
      iface vxlan4001
          vxlan-id 104001
          vxlan-local-tunnelip 10.0.0.11
          bridge-access 4001
      
      auto vrf1
      iface vrf1
         vrf-table auto
      
      #Tenant SVIs - anycast GW
      auto vlan13
      iface vlan13
          address 10.1.3.11/24
          address-virtual 44:39:39:ff:00:13 10.1.3.1/24
          vlan-id 13
          vlan-raw-device bridge
          vrf vrf1
      
      auto vlan24
      iface vlan24
          address 10.2.4.11/24
          address-virtual 44:39:39:ff:00:24 10.2.4.1/24
          vlan-id 24
          vlan-raw-device bridge
          vrf vrf1
      
      #L3 VLAN interface per tenant (for L3 VNI)
      auto vlan4001
      iface vlan4001
          hwaddress 44:39:39:FF:40:94
          vlan-id 4001
          vlan-raw-device bridge
          vrf vrf1
      
      # Loopback interface
      auto lo
      iface lo inet loopback
        address 10.0.0.12/32
        clagd-vxlan-anycast-ip 10.0.0.112
        alias loopback interface
      
      # Management interface
      auto eth0
      iface eth0 inet dhcp
          vrf mgmt
      
      auto mgmt
      iface mgmt
          address 127.0.0.1/8
          address ::1/128
          vrf-table auto
      
      # Port to Server01
      auto swp1
      iface swp1
        alias to Server01
        # This is required for Vagrant only
        post-up ip link set swp1 promisc on
      
      # Port to Server02
      auto swp2
      iface swp2
        alias to Server02
        # This is required for Vagrant only
        post-up ip link set swp2 promisc on
      
      # Port to Leaf01
      auto swp49
      iface swp49
        alias to Leaf01
        # This is required for Vagrant only
        post-up ip link set swp49 promisc on
      
      # Port to Leaf01
      auto swp50
      iface swp50
        alias to Leaf01
        # This is required for Vagrant only
        post-up ip link set swp50 promisc on
      
      # Port to Spine01
      auto swp51
      iface swp51
        mtu 9216
        alias to Spine01
      
      # Port to Spine02
      auto swp52
      iface swp52
        mtu 9216
        alias to Spine02
      
      # MLAG Peerlink bond
      auto peerlink
      iface peerlink
        mtu 9000
        bond-slaves swp49 swp50
      
      # MLAG Peerlink L2 interface.
      # This creates VLAN 4094 that only lives on the peerlink bond
      # No other interface will be aware of VLAN 4094
      auto peerlink.4094
      iface peerlink.4094
        address 169.254.1.2/30
        clagd-peer-ip 169.254.1.1
        clagd-backup-ip 10.0.0.11
        clagd-sys-mac 44:39:39:ff:40:94
        clagd-priority 200
      
      # Bond to Server01
      auto bond01
      iface bond01
        mtu 9000
        bond-slaves swp1
        bridge-access 13
        clag-id 1
      
      # Bond to Server02
      auto bond02
      iface bond02
        mtu 9000
        bond-slaves swp2
        bridge-access 24
        clag-id 2
      
      # Define the bridge for STP
      auto bridge
      iface bridge
        bridge-vlan-aware yes
        # bridge-ports includes all ports related to VxLAN and CLAG.
        # does not include the Peerlink.4094 subinterface
        bridge-ports bond01 bond02 peerlink vni13 vni24 vxlan4001
        bridge-vids 13 24
        bridge-pvid 1
      
      auto vxlan4001
      iface vxlan4001
           vxlan-id 104001
           vxlan-local-tunnelip 10.0.0.12
           bridge-access 4001
      
      # VXLAN Tunnel for Server1-Server3 (Vlan 13)
      auto vni13
      iface vni13
        mtu 9000
        vxlan-id 13
        vxlan-local-tunnelip 10.0.0.12
        bridge-access 13
        mstpctl-bpduguard yes
        mstpctl-portbpdufilter yes
      
      #VXLAN Tunnel for Server2-Server4 (Vlan 24)
      auto vni24
      iface vni24
        mtu 9000
        vxlan-id 24
        vxlan-local-tunnelip 10.0.0.12
        bridge-access 24
        mstpctl-bpduguard yes
        mstpctl-portbpdufilter yes
      
      auto vrf1
      iface vrf1
         vrf-table auto
      
      auto vlan13
      iface vlan13
          address 10.1.3.12/24
          address-virtual 44:39:39:ff:00:13 10.1.3.1/24
          vlan-id 13
          vlan-raw-device bridge
          vrf vrf1
      
      auto vlan24
      iface vlan24
          address 10.2.4.12/24
          address-virtual 44:39:39:ff:00:24 10.2.4.1/24
          vlan-id 24
          vlan-raw-device bridge
          vrf vrf1
      
      #L3 VLAN interface per tenant (for L3 VNI)
      auto vlan4001
      iface vlan4001
          hwaddress 44:39:39:FF:40:94
          vlan-id 4001
          vlan-raw-device bridge
          vrf vrf1
      
      auto lo
      iface lo inet loopback
      
      auto eth0
      iface eth0 inet dhcp
      
      auto eth1
      iface eth1 inet manual
        bond-master uplink
        # Required for Vagrant
        post-up ip link set promisc on dev eth1
      
      auto eth2
      iface eth2 inet manual
        bond-master uplink
        # Required for Vagrant
        post-up ip link set promisc on dev eth2
      
      auto uplink
      iface uplink inet static
        mtu 9000
        bond-slaves none
        bond-mode 802.3ad
        bond-miimon 100
        bond-lacp-rate 1
        bond-min-links 1
        bond-xmit-hash-policy layer3+4
        address 10.1.3.101
        netmask 255.255.255.0
        post-up ip route add default via 10.1.3.1
      
      auto lo
      iface lo inet loopback
      
      auto eth0
      iface eth0 inet dhcp
      
      auto eth1
      iface eth1 inet manual
        bond-master uplink
        # Required for Vagrant
        post-up ip link set promisc on dev eth1
      
      auto eth2
      iface eth2 inet manual
        bond-master uplink
        # Required for Vagrant
        post-up ip link set promisc on dev eth2
      
      auto uplink
      iface uplink inet static
        mtu 9000
        bond-slaves none
        bond-mode 802.3ad
        bond-miimon 100
        bond-lacp-rate 1
        bond-min-links 1
        bond-xmit-hash-policy layer3+4
        address 10.2.4.102
        netmask 255.255.255.0
        post-up ip route add default via 10.2.4.1