The Enterprise use-case are as below in which Brayan a Network Optimization
phase1_telemetry/telemetry_pipeline.py
Phase 1 — The Telemetry & Observability Foundation
Phase 1 — The Telemetry & Observability Foundation lays the groundwork for AI-enabled network operations to monitor in real time and across data silos. Simulates gNMI streaming telemetry from BGP routers as configured with OpenConfig YANG models to export metrics about network statistics, routing updates, CPU usage and protocol health. A BMP collector listens for BGP route advertisements, withdrawals and peer state changes. We stream all telemetry data through a Kafka-style event bus and store it in an InfluxDB-style time-series database for later use to analyze what has occurred and tips on troubleshooting. The feature extraction modules convert this raw telemetry into low-dimensional AI-ready datasets that serve as input to a wide range of ML algorithms and models like anomaly detection, predictive analytics, root-cause analysis, autonomous remediation etc. And in prod, all of the agents are replaced by actual technologies such as pygnmi / gnmi-py, GoBMP, Confluent Kafka and InfluxDB clients to provide a scalable and reliable network observability
Phase 2 — Digital Twin & Simulation Sandbox creates a virtual copy of the production network that allows service providers to test routing and configuration changes on the replica environment before deployment. In this step, BGP infrastructure is modelled as a directed graph with Network toolkit where routers peers and routing paths are nodes and edges in the entire system. Just like Batfish, the Digital Twin environment also allows simulation of preflight actions without impacting your live network by allowing engineers or AI agents to test configuration changes, policy manipulation, route advertisements and failover scenarios. The platform continuously evaluates the simulated topology for instances of routing loops, asymmetric paths, blackholes and policy conflicts prior to pushing any changes into production. It also calculates blast radius, which estimates the number of devices/prefixes/customers/services affected by a proposed change. SLA impact prediction mechanisms measure whether any of the latencies, packet losses, convergence times or path stabilities would lead to SLA rule violations once the SR is implemented. In a production-grade deployment, instead of using the simulated Digital Twin Batfish would use batfish client, while network topology and device configurations are synchronized to Batfish directly from production NMS systems through REST APIs or NETCONF. By nightly reconciling the configuration for the simulated topology with that of the live network, any drift and undocumented changes are flagged as well as operational inconsistencies which would maintain an accurate and up to date Digital Twin into which AI can reference for decision making and autonomous operations.
Abstract:
From the standpoint of enterprise and service provider production networks, BGP path manipulation is perhaps among the most important of functions due to its intimate control over ingress, egress, and flow through their network infrastructure. BGP attributes that influence routing decision for traffic engineering, redundancy, load balancing based on different makeup of a path & latency, disaster recovery using Local Preference, AS Path Prepending, MED which stands for Multi-Exit Discriminator/Weight & Community values. In large customer-production environments, wrong BGP path selection can create routing loops, asymmetric routing, congestion, and packet losses impacting service levels leading to SLA violations and even the complete outage of services affecting customers and business-critical applications. As a result, network engineers must consider carefully any change of policy to adopt it anywhere. Here, it is where —Agentic AI enables— the biggest speedup: it constantly analyses real-time telemetry, BGP updates, historical incidents & network topology data to automatically/syn-autonomously make intelligent routing decisions. Powered by Digital Twin Simulation and AI Insights, the agent predicts the effects of BGP policy changes before operationalizing them, then monitors anomalous route behaviour by detecting unstable peers, and finally proposes routing paths with maximum stability and minimum risk. This allows dynamic adjusting of Local Preference, intelligent traffic rerouting after congestion detection, rampant route leak prevention and decreased blast radius in case of failure-all while continuing to meet SLA. With observability, simulation and autonomous decision-making, Agentic AI changes a legacy reactive approach to BGP operations into one that is proactive, self-healing with extremely resilient network management.
Scenario: Brayan is working as a Network Optimization Engineer at Wipro Telecom in an enterprise network environment. Based on a customer requirement, BGP path manipulation needs to be implemented to optimize traffic flow across the network. Currently, one network path is experiencing continuously increasing traffic utilization, while another segment is impacted due to an OFC (Optical Fiber Cable) cut, resulting in congestion, instability, and potential SLA degradation. To provide a better resolution, the enterprise network must intelligently reroute traffic by modifying BGP attributes such as Local Preference, AS Path Prepending, MED, or Community values to balance traffic and ensure high availability. Traditionally, this process requires manual analysis and configuration by network engineers, which can increase response time during critical incidents. However, with Agentic AI, the system can automatically analyse real-time telemetry, detect congestion and link failures, simulate the impact of routing changes using a Digital Twin topology, and autonomously perform optimized BGP path manipulation. This enables faster convergence, reduced downtime, improved traffic engineering, SLA protection, and intelligent self-healing operations. Let us understand this concept using a simple network topology example.
.png)
Here R1,R2 = Edge router of network
R3 = Route-reflector of network
R4 &R5 = DC-spine device
ISP-A =Jio
ISP-B = Idea
ISP-C = Vodafone
Phase 2 — Digital Twin & Simulation sandbox
It is a virtual copy of the production network for simulating changes in routing and configuration prior to deployment. We model the BGP infrastructure as a directed graph, where routers and peers are nodes and routing paths (connections between routers) are edges. Like the previous entry, Digital Twin can simulate configuration modifications, policy updates, route advertisements and failover scenarios without affecting a production environment as with Batfish. It always inspects for routing loops, blackholes, asymmetric paths and policy conflicts whilst calculating blast radius and predicting SLA impacts (latency, packet loss, convergence time and path stability). In production environments, instead of simulating the platform, Batfish is synchronized with live network topology and configurations through REST APIs or NETCONF from NMS systems. Continuous comparison of the Digital Twin against the production network identifies configuration drift, undocumented changes and operational discrepancies to guarantee sound AI-informed decisions as well as automated operations on the network.
phase2_digital_twin/digital_twin.py
Phase 3 — AI Model Suite
It is one of the intelligence layer within an autonomous network system; AI models analyse the behaviour of each network component and make routing decisions, as well as giving operational explainability. The Perception Module monitors BGP paths and constantly reports on their quality, quickly detects anomalies and instability in the underlying network by simulating GNNs/ LSTM models via heuristic scoring and Isolation Forest training over both telemetry data and routing state information. The RL Decision Agent is a multi-objective reinforcement learning engine to makes the best selection of BGP optimization actions including Local Preference adjustment, AS Path Prepending, failover or rerouting optimizing for their network conditions, SLA requirements. LLM Orchestrator: Work with Large Language Models such as Claude API that uses Natural Language Processing to translate operators’ intent ⇒ Network actions & problem statements ⇒ human-readable explanations of what was done by the AI. These simulated AI components feed into production environments which can upgrade the AI component with trained GNN frameworks such as PyG or DGL, PPO-based reinforcement learning models from stable-baselines3 and also LLM tool-use pipelines with Retrieval-Augmented Generation (RAG) for self-sufficient fully autonomous and explainable network operations.
phase3_ai_models/ai_agent.py
import os, sys, time, random, json
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
from dataclasses import dataclass, field
from typing import List, Dict, Optional, Tuple
from common.models import (
AgentAction, ActionType, AutonomyTier, NetworkStateSnapshot
)
from common.logger import get_logger
log = get_logger("Phase3-AIModels", phase=3)
# ─────────────────────────────────────────────────────────────────────────────
# 3.1 Perception Module — anomaly detection + path quality scoring
# ─────────────────────────────────────────────────────────────────────────────
@dataclass
class PerceptionOutput:
anomaly_score: float # 0.0 = normal, 1.0 = severe anomaly
path_quality_scores: Dict[str, float] # isp → quality (higher = better)
degraded_isps: List[str]
flapping_detected: bool
congestion_detected: bool
recommended_action_types: List[ActionType]
class PerceptionModule:
"""
Network state perception and anomaly detection.
In production this would be a trained Graph Neural Network or LSTM
processing the RIB + telemetry time-series. Here we implement the
same interface with hand-crafted heuristics + Isolation Forest logic.
Production upgrade:
import torch
from torch_geometric.nn import GCNConv
# Load pre-trained GNN checkpoint and run inference
"""
# Thresholds that define 'normal' behaviour
LATENCY_THRESHOLD_MS = 30.0
LOSS_THRESHOLD_PCT = 0.005 # 0.5 %
PEER_DOWN_THRESHOLD = 1
ANOMALY_HIGH = 0.6
ANOMALY_CRITICAL = 0.85
def __init__(self):
self._history: List[float] = [] # sliding window of anomaly scores
def perceive(self, snapshot: NetworkStateSnapshot, features: Dict[str, float]) -> PerceptionOutput:
"""Analyse network state and return perception output."""
# ── path quality per ISP ──────────────────────────────────
isp_quality = {}
for router in snapshot.routers:
for isp, lat in router.latency_ms.items():
loss = router.packet_loss_percent.get(isp, 0)
# Quality score 0→1 (higher = better path)
q = max(0.0, 1.0 - (lat / 100.0) - (loss * 20))
if isp not in isp_quality:
isp_quality[isp] = []
isp_quality[isp].append(q)
path_quality = {isp: round(sum(qs)/len(qs), 3) for isp, qs in isp_quality.items()}
# ── anomaly scoring ───────────────────────────────────────
anomaly = 0.0
anomaly += min(0.4, features.get("avg_latency_ms", 0) / 75.0)
anomaly += min(0.3, features.get("avg_packet_loss_pct", 0) * 60)
anomaly += min(0.3, features.get("peer_down_count", 0) * 0.15)
anomaly = min(1.0, round(anomaly, 3))
self._history.append(anomaly)
if len(self._history) > 20:
self._history.pop(0)
# ── classify issues ───────────────────────────────────────
degraded = [isp for isp, q in path_quality.items() if q < 0.5]
flapping = features.get("peer_down_count", 0) > 2
congestion = any(
u > 85.0 for r in snapshot.routers
for u in r.interface_utilisation.values()
)
# ── recommend action types ────────────────────────────────
recommendations: List[ActionType] = []
if degraded:
recommendations.append(ActionType.LOCAL_PREF_ADJUST)
if flapping:
recommendations.append(ActionType.AS_PATH_PREPEND)
if congestion:
recommendations.append(ActionType.MED_ADJUST)
if anomaly > self.ANOMALY_CRITICAL:
recommendations.append(ActionType.ISP_FAILOVER)
if not recommendations:
recommendations.append(ActionType.NO_ACTION)
out = PerceptionOutput(
anomaly_score = anomaly,
path_quality_scores = path_quality,
degraded_isps = degraded,
flapping_detected = flapping,
congestion_detected = congestion,
recommended_action_types = recommendations,
)
log.info(f"[Perception] anomaly={anomaly:.3f} degraded={degraded} "
f"recommend={[a.value for a in recommendations]}")
return out
# ─────────────────────────────────────────────────────────────────────────────
# 3.2 RL Decision Agent (PPO-style policy)
# ─────────────────────────────────────────────────────────────────────────────
@dataclass
class RLState:
"""Feature vector passed to the RL policy."""
avg_latency_ms: float
avg_loss_pct: float
peer_down_count: float
anomaly_score: float
isp_a_quality: float
isp_b_quality: float
isp_c_quality: float
congestion: float # 0 or 1
class RLDecisionAgent:
"""
Reinforcement Learning decision agent.
Selects the optimal BGP manipulation action given the current network state.
Implements a heuristic policy that mimics a trained PPO agent.
Production upgrade:
from stable_baselines3 import PPO
model = PPO.load("bgp_rl_policy.zip")
obs = np.array([state.avg_latency_ms, ...])
action, _ = model.predict(obs, deterministic=True)
Reward function (multi-objective):
R = -w1 * Δlatency - w2 * Δloss - w3 * Δcost + w4 * Δresilience
"""
# Action type → (autonomy_tier, confidence_base, blast_radius_base)
ACTION_PROFILES = {
ActionType.LOCAL_PREF_ADJUST: (AutonomyTier.TIER_1_AUTO, 0.90, 0.10),
ActionType.MED_ADJUST: (AutonomyTier.TIER_1_AUTO, 0.88, 0.08),
ActionType.COMMUNITY_TAG: (AutonomyTier.TIER_1_AUTO, 0.95, 0.05),
ActionType.AS_PATH_PREPEND: (AutonomyTier.TIER_2_APPROVE, 0.75, 0.35),
ActionType.NEXT_HOP_CHANGE: (AutonomyTier.TIER_2_APPROVE, 0.70, 0.30),
ActionType.PEER_ACTIVATE: (AutonomyTier.TIER_2_APPROVE, 0.72, 0.40),
ActionType.PEER_DEACTIVATE: (AutonomyTier.TIER_3_MANUAL, 0.60, 0.60),
ActionType.PREFIX_FILTER: (AutonomyTier.TIER_2_APPROVE, 0.80, 0.20),
ActionType.ISP_FAILOVER: (AutonomyTier.TIER_3_MANUAL, 0.55, 0.80),
ActionType.FULL_REROUTE: (AutonomyTier.TIER_3_MANUAL, 0.50, 0.95),
ActionType.DEPEER: (AutonomyTier.TIER_3_MANUAL, 0.45, 0.90),
ActionType.NO_ACTION: (AutonomyTier.TIER_1_AUTO, 1.00, 0.00),
}
def __init__(self):
self._episode_count = 0
self._reward_history: List[float] = []
def select_action(self,
perception: PerceptionOutput,
snapshot: NetworkStateSnapshot,
features: Dict[str, float]) -> AgentAction:
"""
Core RL policy: select best action given current state.
"""
self._episode_count += 1
# ── build RL state ────────────────────────────────────────
pq = perception.path_quality_scores
rl_state = RLState(
avg_latency_ms = features.get("avg_latency_ms", 10.0),
avg_loss_pct = features.get("avg_packet_loss_pct", 0.001),
peer_down_count = features.get("peer_down_count", 0.0),
anomaly_score = perception.anomaly_score,
isp_a_quality = pq.get("ISP-A", 0.8),
isp_b_quality = pq.get("ISP-B", 0.8),
isp_c_quality = pq.get("ISP-C", 0.8),
congestion = float(perception.congestion_detected),
)
# ── heuristic policy (replaces trained NN in production) ──
chosen_type = self._policy(rl_state, perception)
tier, conf_base, blast_base = self.ACTION_PROFILES[chosen_type]
# Adjust confidence by anomaly severity
confidence = round(min(0.99, conf_base - perception.anomaly_score * 0.2 + random.gauss(0, 0.03)), 3)
blast = round(min(1.0, blast_base + perception.anomaly_score * 0.1 + random.gauss(0, 0.02)), 3)
# Choose target router (prefer most degraded)
target_router = self._choose_router(snapshot, perception)
# Build parameters
params = self._build_params(chosen_type, rl_state, perception)
action = AgentAction(
action_type = chosen_type,
target_router = target_router,
target_peer = perception.degraded_isps[0] if perception.degraded_isps else "ISP-A",
parameters = params,
confidence = confidence,
blast_radius_score = blast,
autonomy_tier = tier,
rationale = self._build_rationale(chosen_type, rl_state, perception),
)
log.info(f"[RL-Agent] Action selected: {chosen_type.value} "
f"conf={confidence:.2f} tier=T{tier.value} blast={blast:.2f}")
return action
def _policy(self, state: RLState, perception: PerceptionOutput) -> ActionType:
"""
Heuristic policy — maps state → action type.
In production this is the NN forward pass from a trained PPO model.
"""
if state.anomaly_score > 0.85:
return ActionType.ISP_FAILOVER
if state.peer_down_count >= 2:
return ActionType.PEER_ACTIVATE
if state.avg_loss_pct > 0.01:
return ActionType.AS_PATH_PREPEND
if state.congestion > 0.5:
return ActionType.MED_ADJUST
if perception.degraded_isps:
return ActionType.LOCAL_PREF_ADJUST
if state.avg_latency_ms > 25:
return ActionType.NEXT_HOP_CHANGE
return ActionType.NO_ACTION
def _choose_router(self, snapshot: NetworkStateSnapshot,
perception: PerceptionOutput) -> str:
if snapshot.routers:
# Pick router with most peer issues
worst = max(snapshot.routers,
key=lambda r: sum(1 for p in r.bgp_peers
if p.state.value != "ESTABLISHED"))
return worst.router_id
return "R1"
def _build_params(self, action_type: ActionType, state: RLState,
perception: PerceptionOutput) -> dict:
if action_type == ActionType.LOCAL_PREF_ADJUST:
best_isp = max(perception.path_quality_scores,
key=perception.path_quality_scores.get, default="ISP-A")
return {"new_local_pref": 150, "target_isp": best_isp}
if action_type == ActionType.MED_ADJUST:
return {"new_med": 50, "direction": "decrease"}
if action_type == ActionType.AS_PATH_PREPEND:
return {"prepend_count": 2, "target_peer": "ISP-B"}
if action_type == ActionType.ISP_FAILOVER:
best_isp = max(perception.path_quality_scores,
key=perception.path_quality_scores.get, default="ISP-A")
return {"failover_to": best_isp}
return {}
def _build_rationale(self, action_type: ActionType, state: RLState,
perception: PerceptionOutput) -> str:
reasons = {
ActionType.LOCAL_PREF_ADJUST: (
f"Path quality degraded on {perception.degraded_isps}. "
f"Increasing LOCAL_PREF to steer traffic toward better performing ISP."
),
ActionType.MED_ADJUST: (
f"Interface congestion detected (>85%). "
f"Reducing MED to attract traffic from alternate entry points."
),
ActionType.AS_PATH_PREPEND: (
f"Packet loss {state.avg_loss_pct*100:.2f}% exceeds SLA. "
f"AS_PATH prepend on degraded peer to shift traffic."
),
ActionType.ISP_FAILOVER: (
f"Critical anomaly score {state.anomaly_score:.2f}. "
f"Triggering ISP failover to preserve SLA."
),
ActionType.PEER_ACTIVATE: (
f"{int(state.peer_down_count)} peers down. "
f"Attempting peer reactivation to restore redundancy."
),
ActionType.NEXT_HOP_CHANGE: (
f"Average latency {state.avg_latency_ms:.1f}ms exceeds target. "
f"Changing next-hop to lower latency path."
),
ActionType.NO_ACTION: "Network operating within SLA parameters. No action required.",
}
return reasons.get(action_type, "Policy-driven action selection.")
def record_reward(self, reward: float):
self._reward_history.append(reward)
if len(self._reward_history) > 1000:
self._reward_history.pop(0)
@property
def avg_reward(self) -> float:
if not self._reward_history:
return 0.0
return round(sum(self._reward_history) / len(self._reward_history), 4)
# ─────────────────────────────────────────────────────────────────────────────
# 3.3 LLM Orchestration Layer (Claude API)
# ─────────────────────────────────────────────────────────────────────────────
class LLMOrchestrator:
"""
Uses Claude API for:
• Natural-language intent → structured AgentAction translation
• Plain-English explanation of every agent decision
• Root-cause synthesis from telemetry
• Operator Q&A about network state
Falls back to template-based explanations if no API key is set.
Set ANTHROPIC_API_KEY env variable to enable real Claude calls.
"""
SYSTEM_PROMPT = """You are an autonomous BGP network management AI assistant.
You analyse enterprise BGP telemetry and explain network routing decisions
in clear, concise language suitable for NOC engineers.
Always be precise about BGP attributes (LOCAL_PREF, MED, AS_PATH, communities).
Keep explanations under 3 sentences unless asked for more detail."""
def __init__(self):
self._api_key = os.environ.get("ANTHROPIC_API_KEY", "")
self._use_real_api = bool(self._api_key)
if self._use_real_api:
try:
import anthropic
self._client = anthropic.Anthropic(api_key=self._api_key)
log.info("[LLM] Claude API connected (real mode)")
except ImportError:
self._use_real_api = False
log.warning("[LLM] anthropic package not installed — using template mode")
else:
log.info("[LLM] No ANTHROPIC_API_KEY — using template explanations")
def explain_action(self, action: AgentAction, simulation_result=None) -> str:
"""Generate a plain-English explanation of the proposed action."""
sim_summary = ""
if simulation_result:
sim_summary = (
f"Pre-flight simulation: blast_radius={simulation_result.blast_radius:.0%}, "
f"Δlatency={simulation_result.predicted_latency_delta_ms:+.1f}ms, "
f"Δloss={simulation_result.predicted_loss_delta_pct*100:+.3f}%, "
f"passed={simulation_result.passed}"
)
prompt = (
f"Explain this BGP action decision to a NOC engineer in 2-3 sentences:\n"
f"Action: {action.action_type.value}\n"
f"Router: {action.target_router}\n"
f"Parameters: {json.dumps(action.parameters)}\n"
f"Confidence: {action.confidence:.0%}\n"
f"Agent rationale: {action.rationale}\n"
f"Autonomy tier: T{action.autonomy_tier.value}\n"
f"{sim_summary}"
)
if self._use_real_api:
return self._call_claude(prompt)
else:
return self._template_explanation(action, simulation_result)
def translate_intent(self, operator_intent: str, snapshot: NetworkStateSnapshot) -> Dict:
"""Translate natural-language operator intent into structured action parameters."""
prompt = (
f"Translate this operator intent into a JSON BGP action:\n"
f"Intent: '{operator_intent}'\n"
f"Current state: anomaly={snapshot.anomaly_score:.2f}, "
f"peers_down={snapshot.peer_down_count}\n"
f"Return only valid JSON with keys: action_type, target_router, parameters"
)
if self._use_real_api:
raw = self._call_claude(prompt)
try:
start = raw.find("{")
end = raw.rfind("}") + 1
return json.loads(raw[start:end]) if start >= 0 else {}
except json.JSONDecodeError:
pass
# Template fallback
intent_lower = operator_intent.lower()
if "failover" in intent_lower or "fail" in intent_lower:
return {"action_type": "ISP_FAILOVER", "target_router": "R1", "parameters": {"failover_to": "ISP-B"}}
if "maintenance" in intent_lower or "prepend" in intent_lower:
return {"action_type": "AS_PATH_PREPEND", "target_router": "R2", "parameters": {"prepend_count": 3}}
if "prefer" in intent_lower or "local_pref" in intent_lower:
return {"action_type": "LOCAL_PREF_ADJUST", "target_router": "R1", "parameters": {"new_local_pref": 200}}
return {"action_type": "NO_ACTION", "target_router": "R1", "parameters": {}}
def synthesise_root_cause(self, snapshot: NetworkStateSnapshot,
features: Dict[str, float]) -> str:
"""Generate a root-cause summary of current network issues."""
prompt = (
f"Summarise the root cause of current BGP network issues:\n"
f"Anomaly score: {snapshot.anomaly_score:.3f}\n"
f"Avg latency: {features.get('avg_latency_ms',0):.1f}ms\n"
f"Avg loss: {features.get('avg_packet_loss_pct',0)*100:.3f}%\n"
f"Peers down: {snapshot.peer_down_count}\n"
f"Total prefixes: {snapshot.total_prefixes}\n"
f"Respond in 2 sentences."
)
if self._use_real_api:
return self._call_claude(prompt)
return self._template_root_cause(snapshot, features)
def _call_claude(self, prompt: str) -> str:
"""Make a real Anthropic API call."""
try:
import anthropic
msg = self._client.messages.create(
model = "claude-sonnet-4-20250514",
max_tokens = 300,
system = self.SYSTEM_PROMPT,
messages = [{"role": "user", "content": prompt}],
)
return msg.content[0].text
except Exception as e:
log.warning(f"[LLM] API call failed: {e} — falling back to template")
return "[LLM API unavailable]"
def _template_explanation(self, action: AgentAction, sim=None) -> str:
templates = {
ActionType.LOCAL_PREF_ADJUST: (
f"The agent is raising LOCAL_PREF to {action.parameters.get('new_local_pref',150)} "
f"on {action.target_router} to steer traffic toward the higher-quality ISP path. "
f"This change has confidence {action.confidence:.0%} and low blast radius — "
f"it will apply autonomously within 30 seconds."
),
ActionType.AS_PATH_PREPEND: (
f"AS_PATH prepending by {action.parameters.get('prepend_count',2)} hops on "
f"{action.target_router} will make the degraded path less preferred by remote ASes, "
f"naturally shifting inbound traffic to healthier links. "
f"NOC approval is requested before applying."
),
ActionType.ISP_FAILOVER: (
f"Critical anomaly detected — the agent recommends failing over to "
f"{action.parameters.get('failover_to','ISP-B')}. "
f"This is a Tier 3 action requiring explicit NOC sign-off due to high blast radius. "
f"Simulation confirms the failover path is healthy."
),
ActionType.NO_ACTION: (
"All BGP paths are operating within SLA thresholds. "
"No routing changes are required at this time."
),
}
return templates.get(action.action_type,
f"Agent recommends {action.action_type.value} "
f"on {action.target_router}. Rationale: {action.rationale}")
def _template_root_cause(self, snapshot: NetworkStateSnapshot,
features: Dict[str, float]) -> str:
issues = []
if snapshot.peer_down_count > 0:
issues.append(f"{snapshot.peer_down_count} BGP peer(s) are down")
if features.get("avg_latency_ms", 0) > 20:
issues.append(f"elevated latency ({features['avg_latency_ms']:.1f}ms)")
if features.get("avg_packet_loss_pct", 0) > 0.003:
issues.append(f"packet loss ({features['avg_packet_loss_pct']*100:.2f}%)")
if not issues:
return "Network is operating normally. All BGP sessions are established and metrics are within SLA."
return (f"Root cause analysis: {'; '.join(issues)} contributing to anomaly score "
f"{snapshot.anomaly_score:.3f}. "
f"Primary recommendation is path steering via LOCAL_PREF adjustment.")
# ─────────────────────────────────────────────────────────────────────────────
# 3.4 AIAgentCore — combines all three model tiers
# ─────────────────────────────────────────────────────────────────────────────
class AIAgentCore:
"""
The AI Agent Core — ties together perception, RL decision, and LLM layers.
"""
def __init__(self):
self.perception = PerceptionModule()
self.rl_agent = RLDecisionAgent()
self.llm = LLMOrchestrator()
def decide(self, snapshot: NetworkStateSnapshot,
features: Dict[str, float]) -> Tuple[AgentAction, str, str]:
"""
Full decision pipeline:
1. Perceive network state
2. RL agent selects action
3. LLM generates explanation + root-cause
Returns (action, explanation, root_cause)
"""
log.info("\n[AICore] ─── Decision cycle ───────────────────────")
# Step 1: Perception
perception = self.perception.perceive(snapshot, features)
# Step 2: RL decision
action = self.rl_agent.select_action(perception, snapshot, features)
# Step 3: LLM explanation
explanation = self.llm.explain_action(action)
root_cause = self.llm.synthesise_root_cause(snapshot, features)
log.info(f"[AICore] LLM: {explanation[:100]}...")
log.info(f"[AICore] Root cause: {root_cause[:100]}...")
return action, explanation, root_cause
# ─────────────────────────────────────────────────────────────────────────────
# Standalone entry point
# ─────────────────────────────────────────────────────────────────────────────
if __name__ == "__main__":
from phase1_telemetry.telemetry_pipeline import TelemetryPipeline
pipeline = TelemetryPipeline()
snapshot = pipeline.run(cycles=2, interval=0.2)
features = pipeline.get_features(snapshot)
agent = AIAgentCore()
action, explanation, root_cause = agent.decide(snapshot, features)
print(f"\n{'='*60}")
print(f"ACTION : {action.action_type.value}")
print(f"ROUTER : {action.target_router}")
print(f"CONFIDENCE : {action.confidence:.0%}")
print(f"TIER : T{action.autonomy_tier.value}")
print(f"EXPLANATION : {explanation}")
print(f"ROOT CAUSE : {root_cause}")
print(f"{'='*60}\n")
Phase 4 — Constraint Engine & Autonomy Tiers
It is the governance and safety control layer of autonomous BGP management. This engine checks each AI-enabled routing action against predefined network policies, operational guardrails & SLA requirements before they are rolled out into production. Imposes POLA-related standards like: LRO-filtered LOCAL_PREF ranges, no-export community safeguard, restricted routing transient frequency per minute, blast-radius restrictions possible with respect to the autonomy tier of the location and automation-only minimum confidence thresholds. The engine continuously monitors network health after a change and automatically invokes rollback mechanisms when SLA policies are violated based on latency, packet loss or stability thresholds. What is even more interesting is that each and every single decision, approval, execution, rollback generated by AI action is logged in a full audit trail that guarantees operational transparency while helping to maintain compliance and enable traceability across enterprise and service provider environments.
phase5_shadow/shadow_mode.py
phase4_guardrails/constraint_engine.py
Phase 5 — Shadow Mode & Human Feedback Loop
It is the controlled learning stage where the AI agent operates alongside NOC engineers without directly impacting the live production network. In this phase, the AI continuously analyses telemetry, generates BGP optimization decisions, and validates them through the Digital Twin simulation environment before presenting recommendations to operators. NOC engineers can approve, reject, or partially modify the suggested actions, and the system tracks the agreement rate between human decisions and AI recommendations to measure operational trust and readiness for higher autonomy. The collected human feedback is then converted into reinforcement learning reward signals to continuously retrain and improve the AI decision-making model. Additionally, Tier 2 actions can be automatically applied after a predefined approval TTL expires if no operator response is received, enabling gradual transition toward safe autonomous network operations.
Phase 6 — Live Production Deployment
It represents the transition from supervised AI-assisted networking into fully operational autonomous BGP management in production environments. In this phase, validated routing decisions are deployed directly to live routers using NETCONF or RESTCONF APIs, while continuous SLA monitoring verifies network stability after every change. If latency, packet loss, or reachability degrades beyond policy thresholds, the system automatically triggers rollback procedures to restore the previous stable state. The platform also incorporates continuous learning pipelines where operational feedback and rewards retrain RL models, while MLflow tracks model experiments, performance metrics, and version history. A/B shadow testing enables safe comparison of new AI policies against existing production models before promotion, and a real-time operational dashboard provides visibility into network health, AI decisions, SLA compliance, rollback events, and overall autonomous system performance.
This passes the supervised AI-assisted networking into a 100% production autonomous BGP manager. During this phase validated routing decisions are pushed directly to production routers via NETCONF or RESTCONF APIS, and continuous SLA monitoring checks for stability in the network after every change. In the event that latency, packet loss, or reachability exceeds previously defined policy-breaking thresholds, take steps to automatically rollback revert actions to return back into an earlier, stable state. It also includes continuous learning pipelines, in which operational feedback and rewards are used to retrain RL models, while MLflow records model experiments, performance metrics and version history. A/B shadow testing allows for safe evaluation of new AI policies versus the existing production models, prior to promoting changes, while a real-time operational dashboard helps see the network health, AI decisions made on its behalf (like cloud costs), SLA compliance, rollback/switch events and an overall autonomous systems performance.
phase6_production/production_agent.py
AGENTIC AI BASED BGP PATH MANIPULATION END-TO-END PROCESS
Summary:
The Architecture above Express For Agentic AI A New Paradigm For A Intelligent Autonomous Self-Healing Network Optimisation Framework Bringing Traditional BGP Operations Into Enterprise & Service Provider Environments. BGP path manipulation is the heart of most traffic engineering, redundancy strategies, load balancing, disaster recovery and SLA assurance in modern production networks via Local Preference, AS Path Prepending, MED or Weight or even Community values. But manual BGP policy enforcement under congestion, route flapping, or fiber failure loops to create implicated routing problem are asymmetric routes that might lead to potential packet loss and permanent service interruptions. The solution therefore provides the introduction of an AI-based autonomous networking system, that is able to continuously assess streaming telemetry, BGP updates, network topology and historical incident information in making near real-time optimized routing decisions with minimal human intervention.
The Enterprise use-case are as below in which Brayan a Network Optimization Engineer at Wipro Telecom is facing the traffic congestion on one path and due to an OFC cut another segment service degradation. But, engineering teams typically manually assess the traffic flow and/or simulate path changes based on a pre-defined routing policy introduced increasing operational risk and incident response time. Using Agentic AI, it knows when there is congestion and instability based on real-time telemetry, uses Digital Twin simulation to test different paths and automatically manipulates key BGP attributes such as Local Preference, MEDs, AS Path Prepending and Community Values to intelligently reroute traffic. This allows for quicker convergence, distributing traffic, a high availability system, less downtime, proactive SLA protection that reduces the blast radius of failure.
Below is a sample example of our network topology with R1 and R2 edge routers connecting to multiple ISPs (Jio-ISP-A, Idea-ISP-B, Vodafone-ISP-C), where R3 works as Route Reflector; Also, we have placed Different Data centre spine devices here — R4/R5. Phase 1 — The Telemetry & Observability Foundation The AI workflow begins here with the collection of streaming telemetry from routers utilizing simulated gNMI and OpenConfig YANG models Using a BMP collector, the system listens for BGP route advertisements, peer state changes, CPU usage, routing health and protocol statistics. Telemetry is streamed directly through a Kafka-style-like event bus and stored in an observability-ready InfluxDB-style time-series database to enable troubleshooting, anomaly detection, and predictive analytics. Feature Extraction Pipelines — These pipelines are responsible for transforming raw telemetry into machine-learning ready datasets for any model targeting anomalies, root-cause analysis, predictive failure detection and autonomous remediation. Scalable telemetry collection for intelligent network automation in real production deployments enabled by pyGNMI, GoBMP, Confluent Kafka and InfluxDB.

Comments (3)
Great introduction! Looking forward to more HTML5 articles.
Thanks Jane! We have more articles coming soon 🚀
This helped me understand semantic tags better. Thanks!
Could you also write about Canvas API in detail?
Leave a Comment