No description

Find a file

Oleksii Khilkevych 4fa4db3add Remove public IPs		2025-12-09 20:30:24 +01:00
.claude	Convert ASCII art to mermaid diagrams	2025-11-13 04:08:30 +01:00
ansible	v0.0.35 ingress interface is in separate vrf	2025-12-06 15:37:07 +01:00
backend	Active-Active with ECMP multipath	2025-11-07 15:17:01 +01:00
container-test	maglev-test	2025-11-05 18:40:09 +01:00
gw	Convert ASCII art to mermaid diagrams	2025-11-13 04:08:30 +01:00
horizon	Remove public IPs	2025-12-09 20:30:24 +01:00
l4	v0.0.35 ingress interface is in separate vrf	2025-12-06 15:37:07 +01:00
l7	How to blacklist conntract	2025-12-06 04:42:45 +01:00
load-test	Hook to sdn	2025-11-25 00:41:56 +01:00
CLAUDE.md	Doc update	2025-11-11 13:19:27 +01:00
CONFIGURATION.md	Refactor fwmarks	2025-12-05 10:19:23 +01:00
IPFIX.md	Active-Active with ECMP multipath	2025-11-07 15:17:01 +01:00
NOTES.md	v0.0.17, extra inventory checks for vip and groups	2025-11-13 15:21:47 +01:00
README.md	Docs about DNS	2025-11-19 19:39:45 +01:00

README.md

Maglev Load Balancing Infrastructure

High-performance, highly available load balancing infrastructure using BGP ECMP, IPVS Maglev hashing, and HAProxy.

Maglev Load Balancing Infrastructure

Architecture Overview

Network Separation:

Control Plane (BGP): Runs on mgmt network (10.1.7.x / 2a0a:7180:81:7::x)
Data Plane (Traffic): Flows through ingress0 network (10.1.10.x / 2a0a:7180:81:a::x)

This separation allows BGP sessions to remain stable on the management network while data traffic flows through dedicated high-performance interfaces.

%% Enhanced Maglev / Gateway topology diagram (with .x.y addressing)
graph TD
    %% --- Internet and Router ---
    Internet(["🌐 Internet Traffic"])
    MikroTik(["🛜 Mikrotik Router<br/>int-gw03m<br/>.7.2"])

    Internet --> MikroTik

    %% --- Gateways ---
    subgraph Gateways["💠 Gateway Layer"]
        MGW1(["mgw01<br/>Control: .7.4<br/>Data: .10.4<br/><b>FRR Route Reflector</b>"])
        MGW2(["mgw02<br/>Control: .7.5<br/>Data: .10.5<br/><b>FRR Route Reflector</b>"])
    end

    MikroTik -->|iBGP Active<br/>Control: .7.4| MGW1
    MikroTik -.->|iBGP Standby<br/>Control: .7.5| MGW2

    %% --- Maglevs ---
    subgraph Maglevs["⚙️ Maglev Layer"]
        MAG1(["maglev01<br/>Control: .7.17<br/>Data: .10.17<br/><b>ExaBGP + IPVS</b>"])
        MAG2(["maglev02<br/>Control: .7.18<br/>Data: .10.18<br/><b>ExaBGP + IPVS</b>"])
    end

    MGW1 -->|ECMP 50%<br/>5-tuple hash| MAG1
    MGW1 -->|ECMP 50%<br/>5-tuple hash| MAG2
    MGW2 -->|ECMP 50%| MAG1
    MGW2 -->|ECMP 50%| MAG2

    MGW1 -.->|BFD &lt;1s failover| MAG1
    MGW1 -.->|BFD &lt;1s failover| MAG2
    MGW2 -.->|BFD &lt;1s failover| MAG1
    MGW2 -.->|BFD &lt;1s failover| MAG2

    %% --- IPVS Routing Layer ---
    IPVS{{"⚡ IPVS Routing<br/>Fwmark + mh-hash"}}

    MAG1 --> IPVS
    MAG2 --> IPVS

    %% --- Load Balancers ---
    subgraph MLBs["🧭 L7 Backends"]
        subgraph TraditionalHA["Traditional HAProxy"]
            MLB1(["mlb01<br/>Control: .7.33<br/>Data: .10.33"])
            MLB2(["mlb02<br/>Control: .7.34<br/>Data: .10.34"])
            MLB3(["mlb03<br/>Control: .7.35<br/>Data: .10.35"])
        end
        subgraph K8sHA["Kubernetes + HAProxy Ingress"]
            K8S1(["k0s01<br/>Control: .7.61<br/>Data: .10.61"])
            K8S2(["k0s02<br/>Control: .7.62<br/>Data: .10.62"])
            K8S3(["k0s03<br/>Control: .7.63<br/>Data: .10.63"])
        end
    end

    IPVS -.-> MLB1
    IPVS -.-> MLB2
    IPVS -.-> MLB3
    IPVS -.-> K8S1
    IPVS -.-> K8S2
    IPVS -.-> K8S3

    %% --- Application Servers ---
    subgraph Apps["🧩 Application Layer"]
        APP1(["App Server 1<br/>:80/:443"])
        APP2(["App Server 2<br/>:80/:443"])
        APP3(["App Server N<br/>:80/:443"])
    end

    MLB1 --> APP1
    MLB1 --> APP2
    MLB1 --> APP3
    MLB2 --> APP1
    MLB2 --> APP2
    MLB2 --> APP3
    MLB3 --> APP1
    MLB3 --> APP2
    MLB3 --> APP3

    MLBs -.DSR (Direct Server Return).-> Internet

    %% --- Styling ---
    classDef router fill:#fef7e0,stroke:#d6b600,stroke-width:1px,color:#333;
    classDef gateway fill:#e1f5ff,stroke:#0288d1,stroke-width:1px,color:#000;
    classDef maglev fill:#fff4e1,stroke:#f57f17,stroke-width:1px,color:#000;
    classDef mlb fill:#e8f5e9,stroke:#388e3c,stroke-width:1px,color:#000;
    classDef k8s fill:#e3f2fd,stroke:#1976d2,stroke-width:1px,color:#000;
    classDef app fill:#f3e5f5,stroke:#7b1fa2,stroke-width:1px,color:#000;

    class MikroTik router;
    class MGW1,MGW2 gateway;
    class MAG1,MAG2 maglev;
    class MLB1,MLB2,MLB3 mlb;
    class K8S1,K8S2,K8S3 k8s;
    class APP1,APP2,APP3 app;

VIP (Virtual IP)

IPv4: 10.1.7.16/32
IPv6: 2a0a:7180:81:7::10/128

Announced by both maglev nodes via BGP, distributed via ECMP at gateway level.

Simplified VIP Management

VIPs are now configured on a dedicated dummy interface (vip) on both maglev and HAProxy nodes:

Benefits:

Add new VIPs in seconds - just edit netplan, no service restarts needed
FWMark-based IPVS - traffic routed by packet mark, not destination IP
Zero-config BGP - ExaBGP auto-discovers VIPs from dummy interface
Cleaner separation - VIPs isolated from loopback and other interfaces

Adding a new VIP:

# 1. Add to /etc/netplan/53-vip.yaml on all nodes
# 2. Run: sudo netplan apply
# 3. Done! Traffic flows immediately (for ports 80/443)

No keepalived restart, no ExaBGP restart, no haproxy restart required!

BGP Topology

int-gw03m (10.1.7.2)
    ↓ iBGP (active/standby)
    ├─→ mgw01 (10.1.7.4) - Route Reflector [ACTIVE]
    └─→ mgw02 (10.1.7.5) - Route Reflector [STANDBY]
            ↓ iBGP (RR Client)
            ├─→ maglev01 (10.1.7.17)
            └─→ maglev02 (10.1.7.18)

Current setup:

Mikrotik peers with both gateways (mgw01 and mgw02) via BGP
Active/Standby routing: Traffic goes to mgw01, fails over to mgw02 if needed
Why not ECMP? Current Mikrotik hardware/config limitation (not datacenter-grade)
Both gateways act as route reflectors for maglev01/02
Blocks inbound routes from upstream (route-map NO-IN)
Accepts VIP announcements from maglev nodes
Each gateway distributes traffic via ECMP with maximum-paths 2

Potential enhancement (datacenter setup):

Mikrotik ECMP: Configure BGP multipath on Mikrotik
Distribute traffic 50/50 across both gateways
Requires datacenter-grade router with proper ECMP support

Network Plane Separation

The infrastructure separates control plane (BGP) from data plane (traffic forwarding) for improved stability and performance:

Control Plane Network (mgmt)

Purpose: BGP sessions, BFD, management access
IPv4: 10.1.7.x/24
IPv6: 2a0a:7180:81:7::x/64
Components: FRR BGP, ExaBGP, BFD, SSH management

Benefits:

BGP sessions remain stable regardless of data plane load
Management access independent of user traffic
Easier troubleshooting and monitoring
Security isolation

Data Plane Network (ingress0)

Purpose: Actual user traffic forwarding
IPv4: 10.1.10.x/24
IPv6: 2a0a:7180:81:a::x/64
Components: IPVS forwarding, HAProxy traffic

Benefits:

Dedicated bandwidth for user traffic
No BGP overhead on data interfaces
Can use higher-performance NICs/VLANs for data
Better QoS and traffic engineering

Next-Hop Rewriting

Gateways use route-maps to rewrite BGP next-hops to the data plane network:

route-map NH-INGRESS permit 10
 set ip next-hop 10.1.10.4              # mgw01 ingress0 IPv4
 set ipv6 next-hop global 2a0a:7180:81:a::4  # mgw01 ingress0 IPv6

Flow:

Maglev nodes announce VIPs via BGP on control plane (10.1.7.17/18)
Gateways receive BGP updates on control plane
Gateways rewrite next-hop to ingress0 addresses (10.1.10.17/18) before sending to upstream
Data traffic flows through ingress0 network

Dynamic VIP Discovery and Announcement

The ExaBGP announcement script (l4/exabgp/announce-active-active.sh) automatically manages VIP announcements:

Key features:

Reads VIPs from dummy interface: Discovers all VIPs directly from vip interface
Aggregated health per family: Announces ALL IPv4 VIPs if any IPv4 backend is healthy
Simpler logic: No per-VIP granularity - family-wide health checks
Auto-detected next-hop: Uses ingress0 interface IP addresses automatically
Zero-configuration: Add VIPs to netplan without touching the script

How it works:

# Discovers VIPs from vip interface
get_vips_v4() {
    ip -4 -o addr show dev vip | awk '{split($4,a,"/"); print a[1]}'
}

get_vips_v6() {
    ip -6 -o addr show dev vip | awk '{split($4,a,"/"); print a[1]}'
}

# Auto-detects next-hop from ingress0 interface
NEXTHOP_V4=$(ip -4 addr show dev ingress0 | ...)
NEXTHOP_V6=$(ip -6 addr show dev ingress0 | ...)

# Aggregate health check per family
all_ipv4_services_healthy()  # Check ALL IPv4 IPVS services
all_ipv6_services_healthy()  # Check ALL IPv6 IPVS services

# Announce/withdraw based on aggregate health
if all_ipv4_services_healthy; then
    for vip in $(get_vips_v4); do
        announce route ${vip}/32 next-hop ${NEXTHOP_V4}
    done
fi

Benefits:

Add new VIPs by editing netplan only - zero config changes
Works with fwmark-based IPVS (ports 80/443)
Automatically adapts to network configuration
Simpler deployment and maintenance

Health logic:

Old approach: Per-VIP checks (if VIP X has backends → announce VIP X)
New approach: Aggregated per family (if ANY IPv4 service has backends → announce ALL IPv4 VIPs)
Works well with fwmark-based IPVS where multiple VIPs share backend pools

Components

Layer 3/4: Gateways (mgw01/mgw02)

Location: gw/

Purpose: BGP route reflector and ECMP distribution

Technology:

FRR (Free Range Routing)
BGP route reflection
BFD (Bidirectional Forwarding Detection)
ECMP with L3+4 hashing

Key Features:

Routes traffic from upstream (Mikrotik) to maglev cluster
Dual gateways for redundancy (currently active/standby at Mikrotik level)
Each gateway distributes traffic 50/50 via ECMP to maglev nodes (5-tuple hash)
Subsecond failure detection with BFD (< 1s)
Route-reflector with outbound policy support for next-hop rewriting

BGP Route Reflector Configuration:

The gateways act as BGP route reflectors with a critical feature enabled:

router bgp 64516
 bgp route-reflector allow-outbound-policy

Why this is required:

By default, FRR route reflectors do NOT apply outbound policies (like route-maps) to reflected routes. This is per BGP standard behavior to preserve routing information integrity. However, we need to rewrite next-hops from control plane to data plane addresses.

Without allow-outbound-policy:

Maglev announces: 10.1.7.16/32 via 10.1.7.17 (control plane)
Gateway reflects: 10.1.7.16/32 via 10.1.7.17 (unchanged!)
Upstream receives: 10.1.7.16/32 via 10.1.7.17 (WRONG - control plane IP)

With allow-outbound-policy:

Maglev announces: 10.1.7.16/32 via 10.1.7.17 (control plane)
Gateway applies NH-INGRESS route-map
Gateway reflects: 10.1.7.16/32 via 10.1.10.17 (data plane)
Upstream receives: 10.1.7.16/32 via 10.1.10.17 (CORRECT - data plane IP)

The NH-INGRESS route-map rewrites next-hops to ingress0 addresses:

route-map NH-INGRESS permit 10
 set ip next-hop 10.1.10.4              # Data plane IPv4
 set ipv6 next-hop global 2a0a:7180:81:a::4  # Data plane IPv6

This ensures traffic flows through the data plane network (ingress0) while BGP control stays on the management network.

Hosts:

Hostname	Control IPv4	Control IPv6	Data IPv4	Data IPv6	Role
mgmt02-vm-mgw01	`10.1.7.4`	`2a0a:7180:81:7::4`	`10.1.10.4`	`2a0a:7180:81:a::4`	BGP RR + ECMP (Primary)
mgmt02-vm-mgw02	`10.1.7.5`	`2a0a:7180:81:7::5`	`10.1.10.5`	`2a0a:7180:81:a::5`	BGP RR + ECMP (Secondary)

Layer 4: Maglev Nodes (maglev01/02)

Location: l4/

Purpose: Active-active IPVS load balancing with Maglev consistent hashing

Technology:

ExaBGP (BGP route announcements)
keepalived (IPVS management with fwmark-based services, no VRRP)
nftables (packet marking based on interface + port)
IPVS with Maglev (mh) scheduler
BFD for fast failure detection
Direct Routing (DR) mode

Key Features:

Both nodes active simultaneously (no failover)
FWMark-based IPVS - traffic routed by packet mark, enabling flexible VIP:port→backend-group mappings
Simplified VIP management - add VIPs without keepalived config changes
Multiple backend groups - different VIP:port combinations route to different L7 groups (mlb, k0s, etc.)
Announces VIPs to gateway via ExaBGP (auto-discovered from vip interface)
Health-based route announcement (aggregated per IP family)
Maglev hashing with mh-port for per-connection distribution
DR mode for optimal performance (no NAT overhead)
nftables marks incoming packets based on interface+VIP+port (see CONFIGURATION.md)

Hosts:

Hostname	Control IPv4	Control IPv6	Data IPv4	Data IPv6	Role
mgmt02-vm-maglev01	`10.1.7.17`	`2a0a:7180:81:7::11`	`10.1.10.17`	`2a0a:7180:81:a::11`	IPVS + ExaBGP
mgmt02-vm-maglev02	`10.1.7.18`	`2a0a:7180:81:7::12`	`10.1.10.18`	`2a0a:7180:81:a::12`	IPVS + ExaBGP

Layer 7: L7 Backends

The cluster supports multiple types of L7 backends for different workloads:

dnsdist + PowerDNS (DNS Load Balancer)

Location: l7/dnsdist/

Purpose: L7 DNS load balancing with PowerDNS authoritative servers

Technology:

dnsdist (DNS load balancer)
PowerDNS Authoritative Server (secondary mode)
LightningStream (zone replication via S3)
PROXY protocol for client IP preservation
Shadow primary (dns00) with PowerAdmin for zone management

Key Features:

L7 DNS load balancing with rate limiting
PROXY protocol to preserve client source IPs
Pool-based backend grouping
Active-active deployment (both dnsdist nodes always active)
PowerDNS secondary cluster with LightningStream replication
Shadow primary (dns00) for zone management via PowerAdmin

Hosts:

Hostname	Control IPv4	Control IPv6	Data IPv4	Data IPv6	Role
mgmt02-vm-dns00	`10.1.7.40`	`2a0a:7180:81:7::28`	N/A	N/A	Shadow Primary (hidden)
mgmt02-vm-dnsdist01	`10.1.7.41`	`2a0a:7180:81:7::29`	`10.1.10.41`	`2a0a:7180:81:a::29`	dnsdist LB
mgmt02-vm-dnsdist02	`10.1.7.42`	`2a0a:7180:81:7::2a`	`10.1.10.42`	`2a0a:7180:81:a::2a`	dnsdist LB
mgmt02-vm-dns01	`10.1.7.43`	`2a0a:7180:81:7::2b`	`10.1.10.43`	`2a0a:7180:81:a::2b`	PowerDNS
mgmt02-vm-dns02	`10.1.7.44`	`2a0a:7180:81:7::2c`	`10.1.10.44`	`2a0a:7180:81:a::2c`	PowerDNS
mgmt02-vm-dns03	`10.1.7.45`	`2a0a:7180:81:7::2d`	`10.1.10.45`	`2a0a:7180:81:a::2d`	PowerDNS

Documentation: See l7/dnsdist/README.md for detailed DNS architecture including shadow primary, PowerAdmin, PROXY protocol, and LightningStream replication.

Traditional HAProxy (mlb01/02/03)

Location: l7/haproxy/

Purpose: HTTP/HTTPS load balancing and SSL termination

Technology:

HAProxy (TCP mode for L4 forwarding)
VIP on dummy interface (DR mode)
External health check API on port 9300
node_exporter on port 9100 for Prometheus monitoring

Key Features:

Receives traffic from IPVS via Direct Routing
VIP bound on dummy interface (vip) with ARP suppression
Frontends bind directly to VIPs (not ingress0 interface)
Health check endpoint at /health?frontend= on port 9300
Identical haproxy.cfg across all hosts
Dual-stack (IPv4 + IPv6)

Hosts:

Hostname	Control IPv4	Control IPv6	Data IPv4	Data IPv6	Role
mgmt02-vm-mlb01	`10.1.7.33`	`2a0a:7180:81:7::21`	`10.1.10.33`	`2a0a:7180:81:a::21`	HAProxy
mgmt02-vm-mlb02	`10.1.7.34`	`2a0a:7180:81:7::22`	`10.1.10.34`	`2a0a:7180:81:a::22`	HAProxy
mgmt02-vm-mlb03	`10.1.7.35`	`2a0a:7180:81:7::23`	`10.1.10.35`	`2a0a:7180:81:a::23`	HAProxy

Kubernetes + HAProxy Ingress (k0s01/02/03)

Location: l7/k8s-haproxy/

Purpose: Kubernetes-native HTTP/HTTPS load balancing with HAProxy Ingress Controller

Technology:

k0s Kubernetes distribution
HAProxy Ingress Controller (DaemonSet with hostNetwork mode)
Calico CNI (BIRD mode)
VIP on loopback interface (for Calico default IP autodetection - interface=ingress0)
IPVS kube-proxy
Dual-NIC architecture - control and data plane separation

Key Features:

HAProxy runs as Kubernetes Ingress Controller
VIP bound on loopback (lo) - allows Calico default IP autodetection to use interface=ingress0
hostNetwork mode - HAProxy listens directly on host ports 80/443
Dual-NIC setup - control plane (mgmt) for SSH/k8s API, data plane (ingress0) for traffic
Ingress resources define routing rules
Dual-stack Kubernetes networking (IPv4 + IPv6)
Application pods deployed via Kubernetes (e.g., httpbin test app)
Firewall rules (nftables) restrict ports to appropriate interfaces

Hosts:

Hostname	Control IPv4	Control IPv6	Data IPv4	Data IPv6	Role
mgmt02-vm-k0s01	`10.1.7.61`	`2a0a:7180:81:7::3d`	`10.1.10.61`	`2a0a:7180:81:a::3d`	k0s worker
mgmt02-vm-k0s02	`10.1.7.62`	`2a0a:7180:81:7::3e`	`10.1.10.62`	`2a0a:7180:81:a::3e`	k0s worker
mgmt02-vm-k0s03	`10.1.7.63`	`2a0a:7180:81:7::3f`	`10.1.10.63`	`2a0a:7180:81:a::3f`	k0s worker

Note: k0s controller (k0s00 - 10.1.7.60) is separate and not used for L7 traffic forwarding.

Traffic Routing:

Different VIP:port combinations can be routed to different L7 backend groups using fwmark mappings:

VIP 10.1.7.15:80 → Traditional HAProxy (mlb group)
VIP 10.1.7.16:80/443 → Kubernetes HAProxy Ingress (k0s group)
VIP 2a0a:7180:81:7::10:80/443 → Kubernetes HAProxy Ingress (k0s group)

See CONFIGURATION.md for details on fwmark-based routing configuration.

Traffic Flow

Inbound (Client → Application)

1. Client sends request to VIP (e.g., 10.1.7.16:80)
   ↓
2. Mikrotik routes to mgw01 (10.1.7.4)
   ↓
3. mgw01 ECMP selects maglev01 OR maglev02 based on 5-tuple hash
   (src_ip, src_port, dst_ip, dst_port, protocol)
   ↓
4. maglev node:
   - nftables marks packet based on interface+VIP+port (e.g., fwmark 100 for 10.1.7.16:80)
   - IPVS routes by fwmark to appropriate backend group (mlb OR k0s) using Maglev hash + mh-port
   ↓
5. L7 backend receives packet (Direct Routing - no NAT):
   - Traditional HAProxy (mlb01/02/03) - forwards to application backends
   - Kubernetes HAProxy Ingress (k0s01/02/03) - routes via Ingress to pods
   ↓
6. Application responds (e.g., httpbin pod, mock server)

Note: Fwmark-based routing allows different VIP:port combinations to route to different backend groups. Example: 10.1.7.15:80 → mlb group, 10.1.7.16:80/443 → k0s group.

Outbound (Application → Client)

1. Application responds to HAProxy
   ↓
2. HAProxy responds from VIP (10.1.7.16) directly to client
   (Direct Server Return - bypasses maglev and gateway!)
   ↓
3. Client receives response

⚡ DSR: Return traffic doesn't go through maglev or gateway!

Why DSR?

Inbound: 12K req/s × 1KB = 12 MB/s
Outbound: 12K req/s × 10KB = 120 MB/s
DSR saves 120 MB/s on maglev/gateway!

Key Technologies Explained

BGP ECMP (Equal-Cost Multi-Path)

What: Multiple equal-cost paths to the same destination

Where it's used:

Gateways → Maglev nodes (Active): Each gateway sees VIP announced by both maglev nodes
```
10.1.7.16/32 via 10.1.7.17 (maglev01)
10.1.7.16/32 via 10.1.7.18 (maglev02)
```
Mikrotik → Gateways (Potential): Requires datacenter-grade router
- Current: Active/Standby (Mikrotik limitation)
- Potential: ECMP across both gateways with capable upstream router

Distribution: 5-tuple hash (src IP + src port + dst IP + dst port + protocol)

Config: net.ipv4.fib_multipath_hash_policy=1 on gateway

Maglev Consistent Hashing

What: Google's Maglev hashing algorithm for consistent load distribution

Why: Minimal disruption when backends change (consistent hashing property)

How: IPVS mh scheduler with mh-port flag

Key: mh-port includes source port in hash for better distribution

Without: All connections from same client → same backend
With: Different source ports → distributed across backends

Config: ipvsadm -A -t 10.1.7.16:80 -s mh -b mh-port

BFD (Bidirectional Forwarding Detection)

What: Fast failure detection protocol

Why: BGP takes 90-180s to detect failure, BFD takes < 1s

How: Lightweight packets every 300ms, 3 missed = failure

Performance:

Detection time: < 1 second
Bandwidth: ~3 Kbps (negligible)
CPU: < 0.3%

Result: During node failure, only 85 errors out of 1.4M requests (0.0058%)

Direct Routing (DR) Mode

What: Real servers respond directly to client, bypassing load balancer

Why:

Eliminates return path bottleneck
Load balancer only handles inbound (typically 10% of traffic)
Real servers handle outbound (typically 90% of traffic)

Requirements:

VIP configured on VIP interface with /32 prefix (dummy vip or loopback lo)
ARP suppression (arp_ignore=1, arp_announce=2)
Real server must respond from VIP

Stateless Operation

What: The cluster operates without connection tracking at both gateway and maglev layers

Why:

Performance: No state table overhead, pure packet forwarding
Scalability: No memory limits from connection tracking tables
Reliability: No state synchronization issues across ECMP paths
Active-Active: Both maglev nodes operate independently without shared state
Simplicity: Easier to reason about and debug

Implementation:

IPVS conntrack disabled via sysctls: net.ipv4.vs.conntrack=0 and net.ipv6.vs.conntrack=0
No state maintained across packets from the same connection
Connection tracking modules (nf_conntrack) may still be loaded as kernel dependencies, but tracking is disabled

Trade-off: Features requiring connection state (NAT helpers, connection counting) are unavailable, but not needed for this architecture with Direct Server Return.

Verification:

# Check IPVS conntrack disabled (maglev nodes)
sysctl net.ipv4.vs.conntrack net.ipv6.vs.conntrack
# Should return: net.ipv4.vs.conntrack = 0
#                net.ipv6.vs.conntrack = 0

# Check no connections are being tracked
sudo conntrack -L 2>/dev/null | wc -l
# Should return: 0

Performance Characteristics

Measured Performance

Test: wrk with 50 threads, 100 connections, 120 seconds

Application backend: Mock HTTP server (10.1.7.12:8888) returning minimal responses. Production application backend with scaled app servers is planned for future testing.

Results:

Requests/sec: 12,146
Total requests: 1,458,849
Avg latency: 9.11ms
Transfer rate: 1.83 MB/s

Failover test (maglev02 shutdown during test):

Failed requests: 85 out of 1,458,849
Error rate: 0.0058%
Success rate: 99.9942%
Failover time: < 1 second (BFD detected)

Note: Performance numbers reflect infrastructure overhead only. Actual application performance will depend on backend application servers.

Capacity

Current configuration:

Layer	Component	Count	Utilization
L3/4	Gateway	2	Active/Standby
L4	Maglev	2	Active-Active ECMP
L7	Traditional HAProxy (mlb)	3	Active-Active
L7	Kubernetes HAProxy (k0s)	3	Active-Active

Redundancy:

✅ Dual gateways (mgw01 + mgw02) - BGP failover (currently not load balanced)
✅ Dual maglev nodes - Active-active ECMP
✅ Multiple L7 backends - Auto-failover per backend group
✅ Flexible routing - Different VIP:port combinations route to different backend groups

Performance: Tested at 12K req/s with 99.9942% availability during node failure

Scale out:

Add more L7 backends to existing groups → keepalived auto-adds to IPVS pool
Add new L7 backend groups → configure via fwmark mappings
Add more maglev nodes → update gateway maximum-paths
Add third gateway → Additional redundancy (requires ECMP-capable upstream router)
Datacenter enhancement: Use router with BGP ECMP to fully utilize both gateways

Quick Start

Prerequisites

All nodes need:

Ubuntu 22.04+ or Debian 12+
Network connectivity
Passwordless sudo access
Python 3.x

Deployment Method

Recommended: Ansible Automation

The entire cluster can be deployed using Ansible automation:

cd ansible/

# Install dependencies
just sync

# Verify connectivity
just ping

# Deploy entire cluster (dry run first)
just dry-run

# Deploy for real
just deploy

📖 See ansible/README.md for detailed deployment instructions

Manual Setup (Alternative)

For manual component-by-component setup or troubleshooting:

1. Gateway Setup

cd gw/
# Follow README.md for:
# - FRR installation
# - BGP configuration
# - ECMP sysctl settings
# - BFD setup (optional but recommended)

📖 See gw/README.md for detailed instructions

2. Maglev Node Setup

cd l4/

# keepalived (IPVS management)
cd keepalived/
# Follow README.md

# ExaBGP (BGP announcements)
cd ../exabgp/
# Follow README.md

# BFD (optional, for fast failover)
cd ../bfd/
# Follow README.md

📖 See l4/keepalived/README.md and l4/exabgp/README.md

3. L7 Backend Setup

Choose one or both L7 backend types:

Option A: Traditional HAProxy

cd l7/haproxy/
# Follow README.md for:
# - HAProxy installation
# - VIPs on dummy vip interface
# - ARP suppression
# - Network configuration

📖 See l7/haproxy/README.md for detailed instructions

Option B: Kubernetes + HAProxy Ingress

cd l7/k8s-haproxy/
# Follow README.md for:
# - k0s cluster deployment
# - HAProxy Ingress Controller installation
# - VIPs on loopback (for Calico default IP autodetection)
# - Test application deployment

📖 See l7/k8s-haproxy/README.md for detailed instructions

Note: You can run both backend types simultaneously! Use fwmark mappings to route different VIP:port combinations to different backend groups. See CONFIGURATION.md.

Operational Procedures

Planned Maintenance

Taking Down a Maglev Node (Zero Downtime)

Example: Maintenance on maglev02

# Step 1: Stop BGP announcements (gateway stops sending traffic)
ssh maglev02 "sudo systemctl stop exabgp"

# Step 2: Wait for BFD detection and BGP convergence
sleep 2

# Step 3: Verify traffic shifted to maglev01
ssh mgw01 "sudo vtysh -c 'show ip route 10.1.7.16'"
# Should only show: nexthop via 10.1.7.17

# Step 4: Verify no active IPVS connections on maglev02
ssh maglev02 "sudo ipvsadm -Ln --stats"
# ActiveConn should be 0 for all backends

# Step 5: Stop keepalived (safe now, no traffic)
ssh maglev02 "sudo systemctl stop keepalived"

# Step 6: Perform maintenance
ssh maglev02 "sudo apt update && sudo apt upgrade -y"

# Step 7: Restart services
ssh maglev02 "sudo systemctl start keepalived"
sleep 2
ssh maglev02 "sudo systemctl start exabgp"

# Step 8: Verify back in service
ssh mgw01 "sudo vtysh -c 'show ip route 10.1.7.16'"
# Should show both: via 10.1.7.17 and via 10.1.7.18

# Step 9: Verify ECMP is working
ssh mgw01 "ip route show 10.1.7.16"
# Should show: nexthop via 10.1.7.17 dev eth0 weight 1
#              nexthop via 10.1.7.18 dev eth0 weight 1

Expected impact: Zero errors (graceful drain)

Taking Down a HAProxy Backend (Automatic)

Using hactl (recommended):

For managing HAProxy backends across all nodes in an LB group, use the hactl tool:

# Navigate to hactl directory
cd l7/haproxy/hactl

# List all backends and server status
./hactl list

# Put a server in maintenance mode (graceful)
./hactl maint lb_name/server_name

# Bring server back up
./hactl enable lb_name/server_name

See l7/haproxy/hactl/README.md for full documentation including:

Multi-node management (controls all HAProxy nodes simultaneously)
Automatic inconsistency detection
Health check visibility
Clickable stats URLs

Manual Example: Maintenance on mlb02

# Step 1: Stop HAProxy (keepalived will auto-detect)
ssh mlb02 "sudo systemctl stop haproxy"

# Step 2: Verify removed from IPVS pool (both maglev nodes)
ssh maglev01 "sudo ipvsadm -Ln | grep 10.1.7.44"
ssh maglev02 "sudo ipvsadm -Ln | grep 10.1.7.44"
# Should show nothing (backend removed)

# Step 3: Perform maintenance
ssh mlb02 "sudo apt update && sudo apt upgrade -y"

# Step 4: Start HAProxy
ssh mlb02 "sudo systemctl start haproxy"

# Step 5: Verify re-added to IPVS pool
ssh maglev01 "sudo ipvsadm -Ln | grep 10.1.7.44"
# Should show: -> 10.1.7.44:80 Route Weight 1

Expected impact: Minimal (automatic failover to other backends)

Health check frequency: 6 seconds (keepalived)

Taking Down a Gateway (Zero Downtime)

✅ Current setup: Dual gateways (mgw01 and mgw02) - Active/Standby failover

Example: Maintenance on mgw02 (standby)

# Step 1: Verify mgw02 is standby
ssh mikrotik "/routing/bgp/session print where remote-address=10.1.7.5"
# Should show: Established

ssh mikrotik "/ip/route print where dst-address=10.1.7.16/32"
# Should show traffic going via 10.1.7.4 (mgw01 is active)

# Step 2: Stop BGP on mgw02 (no traffic impact - it's standby)
ssh mgw02 "sudo systemctl stop frr"

# Step 3: Perform maintenance on mgw02
ssh mgw02 "sudo apt update && sudo apt upgrade -y"

# Step 4: Restart FRR on mgw02
ssh mgw02 "sudo systemctl start frr"

# Step 5: Verify mgw02 back in service
ssh mikrotik "/routing/bgp/session print where remote-address=10.1.7.5"
# Should show: Established

Expected impact: Zero (mgw02 is standby, no active traffic)

Maintenance on mgw01 (active):

Traffic will automatically failover to mgw02 via BGP
Failover time: 90-180s (BGP hold time) or < 1s with BFD
During failover, traffic continues with minimal impact

Unplanned Failure (Automatic)

Maglev Node Failure

Detection: BFD (< 1s) or BGP (90-180s if no BFD)

Action: Automatic

1. BFD detects failure (300-900ms)
2. Gateway tears down BGP session
3. Gateway removes from ECMP pool
4. All traffic goes to remaining maglev node

Impact: 0.0058% errors (85 out of 1.4M requests in test)

HAProxy Backend Failure

Detection: keepalived health check (6 seconds)

Action: Automatic

1. keepalived HTTP_GET to /haproxy-health fails
2. keepalived removes from IPVS pool on both maglev nodes
3. Traffic distributed to remaining backends

Impact: Minimal (automatic failover within 6-18 seconds)

Gateway Failure

✅ Current setup: Dual gateways provide automatic failover (active/standby)

Detection: BGP (90-180s hold time), faster with BFD if configured on Mikrotik

Action: Automatic

1. mgw01 (active) failure detected by BGP
2. Mikrotik removes failed gateway from routing table
3. Mikrotik routes all traffic via mgw02
4. Traffic continues with no manual intervention

Impact:

With BFD on Mikrotik: < 1 second failover
Without BFD (BGP only): 90-180 second failover
Traffic continues via remaining gateway once failover complete

Note: Currently mgw02 is standby, so its failure has no traffic impact.

Monitoring

Health Checks

Gateway (mgw01/mgw02):

# BGP status
sudo vtysh -c "show bgp summary"
# Should show: 3 neighbors Established (upstream + 2 maglev nodes)

# BFD status
sudo vtysh -c "show bfd peers"
# Should show: 2 peers up (maglev01 and maglev02)

# ECMP routes
ip route show 10.1.7.16
# Should show: 2 nexthops (to maglev nodes)

Maglev nodes:

# ExaBGP status
sudo systemctl status exabgp

# IPVS status
sudo ipvsadm -Ln
# Should show: 3 backends (10.1.7.43/44/45)

# keepalived status
sudo systemctl status keepalived

HAProxy backends:

# HAProxy status
sudo systemctl status haproxy

# Health endpoint
curl http://10.1.7.43/haproxy-health
# Should return: 200 OK

Automated Diagnostics with Horizon

Horizon is a comprehensive diagnostic tool for automated cluster health validation using testinfra.

Location: horizon/

Purpose: Automated testing and validation of all cluster components

Key Features:

Automatic prerequisite validation - SSH and sudo checks before running tests
SSH-based remote testing of all nodes
Pretty console output with checkmarks and warnings
Per-component test organization
Comprehensive coverage of critical cluster components
MTU validation
Interface error statistics
Connection tracking verification (stateless operation)
Cluster validation mode (no SSH needed)

Quick Start:

cd horizon
uv sync
horizon                        # Run all diagnostics (auto-checks prerequisites)
horizon --host maglev          # Check only maglev nodes
horizon --cluster-only         # Fast cluster validation (no SSH)
horizon --force                # Skip prerequisite checks

Example Output:

╭────────────────────────────╮
│ Maglev Cluster Diagnostics │
╰────────────────────────────╯

Maglev Nodes
  maglev01
    ✓ ExaBGP Running
    ✓ Keepalived Running
    ✓ Ingress0 Interface Up
    ✓ Ingress0 MTU
    ✓ Ingress0 No Errors
    ✓ VIP Interface Exists
    ✓ IPVS Configuration
    ✓ IPVS Backends Healthy
    ✓ Connection Tracking Disabled

      Summary
╭──────────┬───────╮
│ Status   │ Count │
├──────────┼───────┤
│ ✓ Passed │    15 │
╰──────────┴───────╯

✓ All checks passed!

Documentation: See horizon/README.md for detailed usage and test coverage.

Monitoring Endpoints

HAProxy stats UI (per host):

Metrics to monitor:

Metric	Command	Alert Threshold
BGP neighbors (gateway)	`vtysh -c "show bgp summary"`	< 3 neighbors
BFD peers (gateway)	`vtysh -c "show bfd peers"`	Any peer down
ECMP paths (gateway)	`ip route show 10.1.7.16`	< 2 nexthops
IPVS backends (maglev)	`ipvsadm -Ln`	< 3 backends
HAProxy health	`curl /haproxy-health`	!= 200

Monitoring Script Example

Create /usr/local/bin/check-lb-health.sh:

#!/bin/bash
# Simple health check for load balancing infrastructure

ERRORS=0

# Check on gateway (mgw01/mgw02)
if hostname | grep -q mgw; then
    # Check BGP neighbors (should be 3: upstream + 2 maglev nodes)
    NEIGHBORS=$(sudo vtysh -c "show bgp summary" | grep -c Established)
    if [ "$NEIGHBORS" -lt 3 ]; then
        echo "ERROR: Only $NEIGHBORS BGP neighbors established (expected 3)"
        ERRORS=$((ERRORS + 1))
    fi

    # Check BFD peers
    BFD_UP=$(sudo vtysh -c "show bfd peers brief" | grep -c " up ")
    if [ "$BFD_UP" -lt 2 ]; then
        echo "ERROR: Only $BFD_UP BFD peers up (expected 2)"
        ERRORS=$((ERRORS + 1))
    fi

    # Check ECMP
    NEXTHOPS=$(ip route show 10.1.7.16 | grep -c "nexthop via")
    if [ "$NEXTHOPS" -lt 2 ]; then
        echo "ERROR: Only $NEXTHOPS ECMP nexthops (expected 2)"
        ERRORS=$((ERRORS + 1))
    fi
fi

# Check on maglev nodes
if hostname | grep -q maglev; then
    # Check ExaBGP
    if ! systemctl is-active --quiet exabgp; then
        echo "ERROR: ExaBGP not running"
        ERRORS=$((ERRORS + 1))
    fi

    # Check keepalived
    if ! systemctl is-active --quiet keepalived; then
        echo "ERROR: keepalived not running"
        ERRORS=$((ERRORS + 1))
    fi

    # Check IPVS backends
    BACKENDS=$(sudo ipvsadm -Ln | grep -c "Route")
    if [ "$BACKENDS" -lt 3 ]; then
        echo "WARNING: Only $BACKENDS IPVS backends (expected 3)"
    fi
fi

# Check on HAProxy backends
if hostname | grep -q mlb; then
    # Check HAProxy
    if ! systemctl is-active --quiet haproxy; then
        echo "ERROR: HAProxy not running"
        ERRORS=$((ERRORS + 1))
    fi

    # Check health endpoint
    MYIP=$(hostname -I | awk '{print $1}')
    if ! curl -sf http://${MYIP}/haproxy-health > /dev/null; then
        echo "ERROR: HAProxy health check failed"
        ERRORS=$((ERRORS + 1))
    fi
fi

if [ $ERRORS -eq 0 ]; then
    echo "OK: All health checks passed"
    exit 0
else
    echo "CRITICAL: $ERRORS errors found"
    exit 2
fi

Install:

sudo chmod +x /usr/local/bin/check-lb-health.sh

# Run from cron
echo "*/5 * * * * /usr/local/bin/check-lb-health.sh" | sudo tee -a /etc/crontab

Troubleshooting

Traffic Not Reaching VIP

# On gateway
sudo vtysh -c "show ip route 10.1.7.16"
# Should show routes to maglev nodes

# Check BGP
sudo vtysh -c "show bgp ipv4 unicast 10.1.7.16/32"
# Should show paths from both maglev nodes

# On maglev node
sudo journalctl -u exabgp | tail
# Should show: "announce route 10.1.7.16/32 next-hop self"

# Check IPVS
sudo ipvsadm -Ln
# Should show virtual server and backends

Traffic Not Distributed Evenly

# On gateway - check ECMP hash policy
sysctl net.ipv4.fib_multipath_hash_policy
# Should be: 1 (L3+4 hashing)

# On maglev - check mh-port
sudo ipvsadm -Ln --sort
# Should show: Scheduler: mh (NOT rr, lc, etc)

# Verify mh-port is enabled (kernel logs)
sudo journalctl -k | grep mh-port

High Error Rate

# Check if BFD is working
sudo vtysh -c "show bfd peers"
# All peers should be "up"

# Check BGP session flapping
sudo journalctl -u frr | grep -i established | tail -20

# Check for network issues
mtr -r -c 100 10.1.7.17
# Should have < 1% packet loss

Advanced Topics

Scaling

Add more HAProxy backends:

Deploy HAProxy on new host
Configure VIPs on appropriate interface (dummy vip or loopback)
keepalived auto-detects via health checks
Automatically added to IPVS pool

Add more maglev nodes:

Deploy maglev03 (10.1.7.19)
Configure ExaBGP, keepalived, BFD
Update gateway: maximum-paths 3
Automatically joins ECMP pool

Add third gateway (requires ECMP-capable router):

Deploy mgw03 with same config
Add to upstream router BGP peers
Enable BGP multipath/ECMP on upstream router
Increases redundancy and capacity

IPv6

All components support dual-stack (IPv4 + IPv6):

VIP IPv6: 2a0a:7180:81:7::10/128
Gateway: 2a0a:7180:81:7::4 (mgw01), ::5 (mgw02)
Maglev: 2a0a:7180:81:7::11, ::12
HAProxy: 2a0a:7180:81:7::2b, ::2c, ::2d

Same configuration applies, just enable ipv6 unicast in BGP.

Security Considerations

Network segmentation:

Management network: SSH access only
Ingress network: Load balancer traffic only
Backend network: Application traffic only

Firewall rules:

Gateway: Allow BGP (179), BFD (3784)
Maglev: Allow BGP (179), BFD (3784), IPVS traffic
HAProxy: Allow HTTP/HTTPS (80/443), health checks

BGP security:

route-map NO-IN blocks upstream routes (prevents route injection)
Route reflector prevents horizontal peering (reduces attack surface)
TTL security (optional): neighbor 10.1.7.17 ttl-security hops 1

Limitations (current non-datacenter setup):

Mikrotik router: Active/Standby gateway usage (no ECMP)
Single upstream router (no redundancy at internet edge)
Can be upgraded with datacenter-grade equipment for full ECMP

Performance Tuning

Kernel Parameters

Gateway nodes:

# IP forwarding
net.ipv4.ip_forward = 1
net.ipv6.conf.all.forwarding = 1

# ECMP multipath hashing (L3+4)
net.ipv4.fib_multipath_hash_policy = 1
net.ipv6.fib_multipath_hash_policy = 1

Maglev nodes:

# IP forwarding
net.ipv4.ip_forward = 1
net.ipv6.conf.all.forwarding = 1

# Stateless operation (NO connection tracking)
net.ipv4.vs.conntrack = 0
net.ipv6.vs.conntrack = 0

HAProxy nodes (traditional and k8s):

# Disable rp_filter for Direct Server Return
net.ipv4.conf.all.rp_filter = 0
net.ipv4.conf.default.rp_filter = 0

# Allow binding to non-local addresses (VIPs)
net.ipv4.ip_nonlocal_bind = 1
net.ipv6.ip_nonlocal_bind = 1

Note: These are automatically deployed by Ansible (ansible/playbooks/deploy_sysctl.yml)

HAProxy Tuning

global
    maxconn 100000        # Increase for high traffic
    nbthread 4            # Match CPU cores

IPVS Tuning

Note: This cluster runs IPVS in stateless mode (net.ipv4.vs.conntrack=0) for optimal performance and scalability. Connection tracking-related tunables are not applicable.

For stateless IPVS operation:

No connection table overhead
No timeout tuning needed
No state synchronization required
Works perfectly with Direct Server Return (DSR)

References

Documentation

Deployment:

Ansible Automation: ansible/README.md - Automated cluster deployment (recommended)
Configuration Management: CONFIGURATION.md - Inventory structure, fwmark mappings, node types
Horizon Diagnostics: horizon/README.md - Automated cluster health validation

Manual Setup (Component-Specific):

Gateway: gw/README.md - FRR, BGP, ECMP
BFD: gw/bfd.md - Fast failure detection
Maglev IPVS: l4/keepalived/README.md - IPVS configuration
ExaBGP: l4/exabgp/README.md - BGP announcements
dnsdist + PowerDNS: l7/dnsdist/README.md - DNS load balancing with dnsdist, PowerDNS, LightningStream, and shadow primary
HAProxy (Traditional): l7/haproxy/README.md - L7 load balancing with standalone HAProxy
HAProxy (Kubernetes): l7/k8s-haproxy/README.md - L7 load balancing with k0s and HAProxy Ingress

External Resources

RFCs

License

This configuration is provided as-is for educational and operational purposes.

DNS Records

; gateways
mgmt02-vm-mgw01                 IN    A         10.1.7.4
mgmt02-vm-mgw02                 IN    A         10.1.7.5
mgmt02-vm-mgw01                 IN    AAAA      2a0a:7180:81:7::4
mgmt02-vm-mgw02                 IN    AAAA      2a0a:7180:81:7::5
ingress0.mgmt02-vm-mgw01        IN    A         10.1.10.4
ingress0.mgmt02-vm-mgw02        IN    A         10.1.10.5
ingress0.mgmt02-vm-mgw01        IN    AAAA      2a0a:7180:81:a::4
ingress0.mgmt02-vm-mgw02        IN    AAAA      2a0a:7180:81:a::5

; maglev
mgmt02-vm-maglev01              IN    A         10.1.7.17
mgmt02-vm-maglev02              IN    A         10.1.7.18
mgmt02-vm-maglev01              IN    AAAA      2a0a:7180:81:7::11
mgmt02-vm-maglev02              IN    AAAA      2a0a:7180:81:7::12
ingress0.mgmt02-vm-maglev01     IN    A         10.1.10.17
ingress0.mgmt02-vm-maglev02     IN    A         10.1.10.18
ingress0.mgmt02-vm-maglev01     IN    AAAA      2a0a:7180:81:a::11
ingress0.mgmt02-vm-maglev02     IN    AAAA      2a0a:7180:81:a::12

;
; mlb
;

; mlb mgmt
mgmt02-vm-mlb01                 IN    A         10.1.7.33
mgmt02-vm-mlb02                 IN    A         10.1.7.34
mgmt02-vm-mlb03                 IN    A         10.1.7.35
mgmt02-vm-mlb01                 IN    AAAA      2a0a:7180:81:7::21
mgmt02-vm-mlb02                 IN    AAAA      2a0a:7180:81:7::22
mgmt02-vm-mlb03                 IN    AAAA      2a0a:7180:81:7::23
; mlb ingress
ingress0.mgmt02-vm-mlb01        IN    A         10.1.10.33
ingress0.mgmt02-vm-mlb02        IN    A         10.1.10.34
ingress0.mgmt02-vm-mlb03        IN    A         10.1.10.35
ingress0.mgmt02-vm-mlb01        IN    AAAA      2a0a:7180:81:a::21
ingress0.mgmt02-vm-mlb02        IN    AAAA      2a0a:7180:81:a::22
ingress0.mgmt02-vm-mlb03        IN    AAAA      2a0a:7180:81:a::23

;
; k0s
;

; k0s cp
mgmt02-vm-k0s00                 IN    A         10.1.7.60
mgmt02-vm-k0s00                 IN    AAAA      2a0a:7180:81:7::3c
; edge k0s workers mgmt
mgmt02-vm-k0s01                 IN    A         10.1.7.61
mgmt02-vm-k0s02                 IN    A         10.1.7.62
mgmt02-vm-k0s03                 IN    A         10.1.7.63
mgmt02-vm-k0s01                 IN    AAAA      2a0a:7180:81:7::3d
mgmt02-vm-k0s02                 IN    AAAA      2a0a:7180:81:7::3e
mgmt02-vm-k0s03                 IN    AAAA      2a0a:7180:81:7::3f
; edge k0s workers ingress
ingress0.mgmt02-vm-k0s01        IN    A         10.1.10.61
ingress0.mgmt02-vm-k0s02        IN    A         10.1.10.62
ingress0.mgmt02-vm-k0s03        IN    A         10.1.10.63
ingress0.mgmt02-vm-k0s01        IN    AAAA      2a0a:7180:81:a::3d
ingress0.mgmt02-vm-k0s02        IN    AAAA      2a0a:7180:81:a::3e
ingress0.mgmt02-vm-k0s03        IN    AAAA      2a0a:7180:81:a::3f
; other k0s workers mgmt
mgmt02-vm-k0s04                 IN    A         10.1.7.64
mgmt02-vm-k0s05                 IN    A         10.1.7.65
mgmt02-vm-k0s06                 IN    A         10.1.7.66
mgmt02-vm-k0s04                 IN    AAAA      2a0a:7180:81:7::40
mgmt02-vm-k0s05                 IN    AAAA      2a0a:7180:81:7::41
mgmt02-vm-k0s06                 IN    AAAA      2a0a:7180:81:7::42
; other k0s workers ingress
ingress0.mgmt02-vm-k0s04        IN    A         10.1.10.64
ingress0.mgmt02-vm-k0s05        IN    A         10.1.10.65
ingress0.mgmt02-vm-k0s06        IN    A         10.1.10.66
ingress0.mgmt02-vm-k0s04        IN    AAAA      2a0a:7180:81:a::40
ingress0.mgmt02-vm-k0s05        IN    AAAA      2a0a:7180:81:a::41
ingress0.mgmt02-vm-k0s06        IN    AAAA      2a0a:7180:81:a::42

; dns
; shadow primary
mgmt02-vm-dns00                 IN    A         10.1.7.40
mgmt02-vm-dns00                 IN    AAAA      2a0a:7180:81:7::28

; dnsdist
; dnsdist mgmt
mgmt02-vm-dnsdist01             IN    A         10.1.7.41
mgmt02-vm-dnsdist02             IN    A         10.1.7.42
mgmt02-vm-dnsdist01             IN    AAAA      2a0a:7180:81:7::29
mgmt02-vm-dnsdist02             IN    AAAA      2a0a:7180:81:7::2a
; dnsdist ingress
ingress0.mgmt02-vm-dnsdist01    IN    A         10.1.10.41
ingress0.mgmt02-vm-dnsdist02    IN    A         10.1.10.42
ingress0.mgmt02-vm-dnsdist01    IN    AAAA      2a0a:7180:81:a::29
ingress0.mgmt02-vm-dnsdist02    IN    AAAA      2a0a:7180:81:a::2a

; dns
; dns mgmt
mgmt02-vm-dns01                 IN    A         10.1.7.43
mgmt02-vm-dns02                 IN    A         10.1.7.44
mgmt02-vm-dns03                 IN    A         10.1.7.45
mgmt02-vm-dns01                 IN    AAAA      2a0a:7180:81:7::2b
mgmt02-vm-dns02                 IN    AAAA      2a0a:7180:81:7::2c
mgmt02-vm-dns03                 IN    AAAA      2a0a:7180:81:7::2d
; dns ingress
ingress0.mgmt02-vm-dns01        IN    A         10.1.10.43
ingress0.mgmt02-vm-dns02        IN    A         10.1.10.44
ingress0.mgmt02-vm-dns03        IN    A         10.1.10.45
ingress0.mgmt02-vm-dns01        IN    AAAA      2a0a:7180:81:a::2b
ingress0.mgmt02-vm-dns02        IN    AAAA      2a0a:7180:81:a::2c
ingress0.mgmt02-vm-dns03        IN    AAAA      2a0a:7180:81:a::2d

README.md Unescape Escape

Maglev Load Balancing Infrastructure

Table of Contents

Architecture Overview

VIP (Virtual IP)

Simplified VIP Management

BGP Topology

Network Plane Separation

Control Plane Network (mgmt)

Data Plane Network (ingress0)

Next-Hop Rewriting

Dynamic VIP Discovery and Announcement

Components

Layer 3/4: Gateways (mgw01/mgw02)

Layer 4: Maglev Nodes (maglev01/02)

Layer 7: L7 Backends

dnsdist + PowerDNS (DNS Load Balancer)

Traditional HAProxy (mlb01/02/03)

Kubernetes + HAProxy Ingress (k0s01/02/03)

Traffic Flow

Inbound (Client → Application)

Outbound (Application → Client)

Key Technologies Explained

BGP ECMP (Equal-Cost Multi-Path)

Maglev Consistent Hashing

BFD (Bidirectional Forwarding Detection)

Direct Routing (DR) Mode

Stateless Operation

Performance Characteristics

Measured Performance

Capacity

Quick Start

Prerequisites

Deployment Method

Manual Setup (Alternative)

1. Gateway Setup

2. Maglev Node Setup

3. L7 Backend Setup

Option A: Traditional HAProxy

Option B: Kubernetes + HAProxy Ingress

Operational Procedures

Planned Maintenance

Taking Down a Maglev Node (Zero Downtime)

Taking Down a HAProxy Backend (Automatic)

Taking Down a Gateway (Zero Downtime)

Unplanned Failure (Automatic)

Maglev Node Failure

HAProxy Backend Failure

Gateway Failure

Monitoring

Health Checks

Automated Diagnostics with Horizon

Monitoring Endpoints

Monitoring Script Example

Troubleshooting

Traffic Not Reaching VIP

Traffic Not Distributed Evenly

High Error Rate

Advanced Topics

Scaling

IPv6

Security Considerations

Performance Tuning

Kernel Parameters

HAProxy Tuning

IPVS Tuning

References

Documentation

External Resources

RFCs

License

DNS Records

README.md