[Borg-Orchestrator 05] Designing the Six-Layer Orchestrator Stack

The six-layer orchestrator was the point where the project stopped being a model pipeline and became a control-plane experiment. That change was important because scores alone do not operate systems. Decisions do.

A risk score is useful only if something knows how to consume it. A demand estimate matters only if it changes how capacity, admission, or safety tradeoffs are evaluated. Once I accepted that, the project needed boundaries: where raw observations enter, where models speak, where agents propose, where conflict is resolved, and where the UI explains the decision.

This phase was my attempt to make that control flow explicit enough to criticize. If the system made a bad decision, I wanted to know which layer produced the mistake instead of blaming a vague black box.

Reading this dashboard

This screenshot is close to the mental picture I wanted from the beginning. At the top, the dashboard shows the active controller state instead of making me dig through logs. In this run the max risk is 0.950 and the selected decision is AgentA:replicate against borg-experimental-worker2. That means the safety agent won the referee decision because the risk path crossed the high threshold.

The live orchestration flow is also important. It shows that the system is not one model call. The path is: Kubernetes cluster snapshot, workload exerciser stimulus, XGBoost risk/demand inference, Agent A/B/C proposals, referee selection, reward scoreboard, and event emission. I wanted this displayed because otherwise it is too easy to talk about the orchestrator as if it were a single black box.

Agent responsibilities

The agent split was intentionally blunt. Agent A is safety. Agent B is efficiency. Agent C is admission and queue pressure. I did not want one giant policy object that quietly mixed all objectives. Separate agents made conflict visible, and visible conflict made the dashboard more honest.

@dataclass(slots=True)
class AgentARiskMitigator:
    priority: int = 1

    def act(self, obs: Observation) -> AgentAction:
        if not obs.p_fail_scores:
            return AgentAction("AgentA", ActionKind.NOOP, score=0.0, priority=self.priority)
        node_id, score = max(obs.p_fail_scores.items(), key=lambda kv: kv[1])
        if score >= 0.83:
            return AgentAction("AgentA", ActionKind.REPLICATE, target=node_id, score=float(score), priority=self.priority)
        if score >= 0.7:
            return AgentAction("AgentA", ActionKind.MIGRATE, target=node_id, score=float(score), priority=self.priority)
        if score >= 0.5:
            return AgentAction("AgentA", ActionKind.THROTTLE, target=node_id, score=float(score), priority=self.priority)
        return AgentAction("AgentA", ActionKind.NOOP, score=float(score), priority=self.priority)

Agent B and Agent C were equally direct. Agent B looked for low demand and proposed power-state, DVFS, or memory balloon actions. Agent C looked at queue length, SLA pressure, and overloaded nodes, then proposed admission or resource caps. I liked this split because it matched the operational tension I had seen around Kubernetes: safety, efficiency, and admission often pull in different directions.

The referee was the uncomfortable part

Once the agents existed, I needed to decide what happened when they disagreed. I did not want silent winner-takes-all behavior. The referee had to produce a selected action and a rationale, plus an overridden map so the dashboard could show what lost and why.

SAFETY_ACTIONS = {ActionKind.MIGRATE, ActionKind.REPLICATE, ActionKind.THROTTLE}
EFFICIENCY_ACTIONS = {ActionKind.POWER_STATE, ActionKind.DVFS, ActionKind.MEMORY_BALLOON}
PROTECTIVE_ADMISSION_DECISIONS = {"queue", "reject", "deprioritize"}

if agent_a_safety is not None:
    return RefereeDecision(
        action=agent_a_safety,
        rationale=f"agent-a {safety_label} preempts lower-priority actions",
        overridden=overridden,
    )

if restrictive_admission is not None:
    return RefereeDecision(
        action=restrictive_admission,
        rationale="agent-c admission protection preempts efficiency actions",
        overridden=overridden,
    )

The referee policy is conservative on purpose. Safety can preempt efficiency. Admission protection can preempt non-safety actions. Efficiency gets a chance when the system is not screaming. This made the controller less flashy, but much easier to reason about.

Why the dashboard became central

At this point I realized the dashboard was not a final reporting layer. It was part of the debugging process. If Agent A was selected, I needed to see the risk score and target. If Agent B lost, I needed to see the demand estimate. If Agent C proposed admission control, I needed to see queue length. If the active stage was complete but the cluster was still unhealthy, I needed the dashboard to show that too.

That is why I kept adding state fields: current decision, proposal list, reward summary, stage progress, artifact list, Optuna history, Ray status, cluster snapshot, event stream. It made the UI dense, but the project needed density. A polished dashboard that hides the control path would not have served the experiment.

What This Phase Proved

By the end of this phase, I had the basic six-layer stack. It could train or load XGBoost brains, run agents, resolve conflicts, score rewards, and emit dashboard state. It was still mostly local and controlled, but it finally had the shape I wanted: a cloud-systems experiment where the model had to live inside a visible controller.

The stack was valuable because it made responsibility visible. It separated prediction, proposal, arbitration, execution, and explanation into parts that could be inspected. That separation is what turned the project from a model demo into an engineering system.