The last major phase was about honesty. Were controller decisions actually touching Kubernetes, and was the dashboard explaining the result without pretending to know more than it did?

The dual-cluster comparison gave me a strong local framework, but it also exposed a subtle gap. Agent A/B/C decisions were evaluated through the orchestrator path, while the dashboard read real Kubernetes metrics. The UI could show real cluster state, but it still had to distinguish recommendations from actions that actually mutated controlled resources.

I did not want a dashboard that looked alive while the controller was only narrating. So I added a narrow live Kubernetes action executor. Narrow is the important word here: controlled namespaces, labeled exercise deployments, bounded mutations, and every kubectl operation recorded.

Bounded action execution

The executor records the command, return code, stdout, and stderr for every operation. I wanted this because silent mutation is dangerous. If the controller claims it scaled or capped something, I need the action trail in the decision payload.

def _record_operation(
    kubeconfig: str | Path,
    operations: list[dict[str, Any]],
    description: str,
    args: list[str],
) -> None:
    completed = _run_kubectl(kubeconfig, args)
    operations.append({
        "description": description,
        "command": "kubectl " + " ".join(args),
        "returncode": completed.returncode,
        "stdout": completed.stdout.strip(),
        "stderr": completed.stderr.strip(),
    })

Scaling and resource changes are similarly bounded. The executor discovers deployments only by the orchestrator exerciser label, then applies a small set of allowed changes. It can scale exercise deployments, cap a comparison load generator, or restart controlled work. It is not a general-purpose cluster automation tool, and I prefer it that way.

def execute_live_kubernetes_action(
    action: AgentAction,
    kubeconfig: str | Path,
    *,
    namespace: str = DEFAULT_EXERCISE_NAMESPACE,
    workload_namespace: str = DEFAULT_WORKLOAD_NAMESPACE,
) -> dict[str, Any]:
    names, discovery_error = _deployment_names(kubeconfig, namespace)
    operations: list[dict[str, Any]] = []
    result: dict[str, Any] = {
        "status": "observed",
        "namespace": namespace,
        "agent": action.agent_name,
        "kind": action.kind.value,
        "target": action.target,
        "payload": dict(action.payload),
        "matched_deployments": names,
        "workload_namespace": workload_namespace,
        "operations": operations,
    }
    if discovery_error:
        result["status"] = "error"
        result["error"] = discovery_error
        return result
    if not names or action.kind == ActionKind.NOOP:
        result["status"] = "no_targets" if not names else "noop"
        return result

Attaching execution to the dashboard state

The live loop attaches the execution result to the decision payload. This changed the meaning of the dashboard. A decision could now show not just AgentA:replicate, but also whether the bounded Kubernetes action was attempted and what kubectl returned.

if exercise_cluster:
    decision_payload["kubernetes_execution"] = execute_live_kubernetes_action(
        action,
        kubeconfig_path,
        namespace=exercise_namespace,
    )

state.decision(decision_payload)

This made the dashboard less clean, but more honest. Failed operations, no target matches, and no-op states are part of the experiment. I would rather show that mess than publish a controller story that hides whether anything happened.

Energy limits

Energy was the easiest metric to overclaim, so I kept the boundary explicit. The project uses a utilization-derived dynamic power estimate, not a physical wattmeter. It is useful for comparing controlled workload pressure under the same local model. It is not a claim about exact machine power consumption.

@dataclass(frozen=True, slots=True)
class PowerCalibration:
    idle_watts: float = 80.0
    cpu_full_scale_watts: float = 120.0
    mem_full_scale_watts: float = 60.0
    source: str = "default_utilization_model"

def estimate_node_power_watts(
    cpu_util: float,
    mem_util: float,
    calibration: PowerCalibration | None = None,
) -> float:
    calibrated = calibration or DEFAULT_POWER_CALIBRATION
    return (
        calibrated.idle_watts
        + (calibrated.cpu_full_scale_watts * _bounded_ratio(cpu_util))
        + (calibrated.mem_full_scale_watts * _bounded_ratio(mem_util))
    )

The comparison dashboard also separates controlled dynamic power from whole-cluster background noise. That was important because otherwise I could accidentally make the experimental side look worse or better because of unrelated observability or control-plane activity.

What Finally Remained

The final local system has Borg-derived features, XGBoost risk and demand models, a six-layer orchestrator, Agent A/B/C proposals, a deterministic referee, optional Optuna and Ray/RLlib adaptation, live Kubernetes exercise loops, a dual-cluster comparison setup, and dashboards that show the moving parts. It is not a production autoscaler. It is a personal cloud-systems research rig that lets me ask more precise questions than I could ask by staring at HPA events alone.

The most useful thing I got from the project was not a single metric. It was a workflow: create pressure, collect live state, let the controller propose, show the conflict, compare against a baseline, and keep the interpretation boundary visible. That workflow is why the dashboards became the main part of the project.

If I continue this later, the next version should run longer repeated experiments, preserve more dashboard snapshots per run, and eventually move the comparison from Kind to EKS with real Karpenter. For now, the local version is complete enough for me to explain the whole path from Kubernetes frustration to a live orchestration experiment.

This final phase made the project complete enough to explain without overclaiming. It showed where prediction ended, where decision began, where Kubernetes was actually touched, and where measurement had to stay humble. That boundary is the difference between a polished demo and an engineering artifact I can stand behind.