[Borg-Orchestrator 05] Six-Layer Orchestrator Stack 설계하기

six-layer orchestrator는 프로젝트가 model pipeline을 넘어 control-plane experiment가 되기 시작한 지점이었다. 이 변화가 중요했던 이유는 score만으로는 system을 운영할 수 없기 때문이다. system을 움직이는 것은 decision이다.

risk score는 그것을 소비할 대상이 있을 때만 유용하다. demand estimate도 capacity, admission, safety tradeoff를 바꾸지 못한다면 의미가 약하다. 그 지점을 인정하고 나니 프로젝트에는 boundary가 필요했다. raw observation은 어디로 들어오고, model은 어디에서 말하고, agent는 어디에서 제안하고, conflict는 어디에서 해소되고, UI는 어디에서 decision을 설명하는지 나눠야 했다.

이 단계는 그 control flow를 비판 가능할 만큼 명시적으로 만들려는 시도였다. system이 나쁜 결정을 내렸다면, 막연한 black box를 탓하는 대신 어느 layer에서 문제가 생겼는지 알고 싶었다.

이 dashboard 읽기

이 screenshot은 처음부터 내가 머릿속에 그렸던 모습에 가깝다. 위쪽에서 dashboard는 log를 직접 뒤지지 않아도 active controller state를 보여 준다. 이 run에서 max risk는 0.950이고 selected decision은 borg-experimental-worker2를 대상으로 한 AgentA:replicate다. 즉 risk path가 high threshold를 넘었기 때문에 safety agent가 referee decision에서 이겼다는 뜻이다.

live orchestration flow도 중요하다. 이것은 system이 단순히 model call 하나로 끝나는 것이 아니라는 점을 보여 준다. path는 Kubernetes cluster snapshot, workload exerciser stimulus, XGBoost risk/demand inference, Agent A/B/C proposal, referee selection, reward scoreboard, event emission이다. 이것을 표시하고 싶었던 이유는, 그렇지 않으면 orchestrator를 하나의 black box처럼 말하기가 너무 쉬워지기 때문이다.

Agent responsibility

agent 분리는 의도적으로 직설적이었다. Agent A는 safety다. Agent B는 efficiency다. Agent C는 admission과 queue pressure다. 모든 objective를 조용히 섞어 버리는 거대한 policy object 하나를 만들고 싶지 않았다. agent를 나누면 conflict가 보였고, conflict가 보이면 dashboard가 더 정직해졌다.

@dataclass(slots=True)
class AgentARiskMitigator:
    priority: int = 1

    def act(self, obs: Observation) -> AgentAction:
        if not obs.p_fail_scores:
            return AgentAction("AgentA", ActionKind.NOOP, score=0.0, priority=self.priority)
        node_id, score = max(obs.p_fail_scores.items(), key=lambda kv: kv[1])
        if score >= 0.83:
            return AgentAction("AgentA", ActionKind.REPLICATE, target=node_id, score=float(score), priority=self.priority)
        if score >= 0.7:
            return AgentAction("AgentA", ActionKind.MIGRATE, target=node_id, score=float(score), priority=self.priority)
        if score >= 0.5:
            return AgentAction("AgentA", ActionKind.THROTTLE, target=node_id, score=float(score), priority=self.priority)
        return AgentAction("AgentA", ActionKind.NOOP, score=float(score), priority=self.priority)

Agent B와 Agent C도 마찬가지로 직접적이었다. Agent B는 낮은 demand를 찾고 power-state, DVFS, memory balloon action을 제안했다. Agent C는 queue length, SLA pressure, overloaded node를 보고 admission이나 resource cap을 제안했다. 이 분리는 Kubernetes 주변에서 보았던 operational tension과 맞아떨어졌다. safety, efficiency, admission은 항상 같은 방향으로 움직이는 목표가 아니다.

Referee가 불편한 부분이었다

agent들이 생기고 나자, 서로 disagree할 때 무슨 일이 일어나야 하는지 결정해야 했다. 조용한 winner-takes-all behavior는 원하지 않았다. referee는 selected action과 rationale을 만들어야 했고, dashboard가 무엇이 졌고 왜 졌는지 보여 줄 수 있도록 overridden map도 만들어야 했다.

SAFETY_ACTIONS = {ActionKind.MIGRATE, ActionKind.REPLICATE, ActionKind.THROTTLE}
EFFICIENCY_ACTIONS = {ActionKind.POWER_STATE, ActionKind.DVFS, ActionKind.MEMORY_BALLOON}
PROTECTIVE_ADMISSION_DECISIONS = {"queue", "reject", "deprioritize"}

if agent_a_safety is not None:
    return RefereeDecision(
        action=agent_a_safety,
        rationale=f"agent-a {safety_label} preempts lower-priority actions",
        overridden=overridden,
    )

if restrictive_admission is not None:
    return RefereeDecision(
        action=restrictive_admission,
        rationale="agent-c admission protection preempts efficiency actions",
        overridden=overridden,
    )

referee policy는 의도적으로 conservative하다. Safety는 efficiency를 preempt할 수 있다. Admission protection은 non-safety action을 preempt할 수 있다. Efficiency는 system이 위험 신호를 크게 내고 있지 않을 때 기회를 얻는다. 덕분에 controller는 덜 화려해졌지만, 훨씬 더 이해하기 쉬워졌다.

Dashboard가 중심이 된 이유

이 시점에서 dashboard는 마지막 reporting layer가 아니라는 걸 깨달았다. dashboard는 debugging process의 일부였다. Agent A가 선택됐다면 risk score와 target을 봐야 했다. Agent B가 졌다면 demand estimate를 봐야 했다. Agent C가 admission control을 제안했다면 queue length를 봐야 했다. active stage가 complete인데도 cluster가 여전히 unhealthy하다면, dashboard가 그것도 보여 줘야 했다.

그래서 current decision, proposal list, reward summary, stage progress, artifact list, Optuna history, Ray status, cluster snapshot, event stream 같은 state field를 계속 추가했다. UI는 dense해졌지만, 이 프로젝트에는 그 density가 필요했다. control path를 숨기는 polished dashboard는 이 실험에 적합하지 않았을 것이다.

이 단계가 남긴 것

이 단계가 끝날 무렵에는 기본 six-layer stack이 있었다. XGBoost brains를 train하거나 load할 수 있었고, agent를 실행하고, conflict를 resolve하고, reward를 score하고, dashboard state를 emit할 수 있었다. 아직 대부분 local이고 controlled한 상태였지만, 드디어 내가 원하던 모양을 갖췄다. model이 visible controller 안에서 살아야 하는 cloud-systems experiment였다.

이 stack의 가치는 responsibility를 보이게 만든 데 있었다. prediction, proposal, arbitration, execution, explanation을 검토 가능한 부분으로 분리했다. 그 분리가 프로젝트를 model demo에서 engineering system으로 바꿨다.