[Borg-Orchestrator 04] Multi-Horizon XGBoost 장애 모델 해석하기

데이터가 더 이상 명백하게 수상해 보이지 않게 된 뒤, 다음 위험은 model을 과신하는 것이었다. 좋은 validation number는 매력적이지만, orchestration system은 notebook처럼 metric을 소비할 수 없다.

advanced XGBoost track은 더 실용적인 질문에 답해야 했다. model이 controller가 시간 horizon별로 판단할 수 있는 signal을 드러내는가? short-horizon risk, long-horizon risk, resource pressure, priority, churn, recent delta는 모두 다른 이야기를 한다. 그것을 하나의 인상적인 숫자로 접어 버리면 중요한 부분이 사라진다.

그래서 나는 model을 system design의 입력으로 다뤘다. 목표는 단순히 AUCPR을 올리는 것이 아니었다. 어떤 signal이 control path의 일부가 될 만큼 안정적인지 이해하는 것이었다.

Multi-horizon으로 생각하기

첫 baseline target은 기본적으로 configured horizon 안에서 failure가 발생하는지였다. advanced track은 이 아이디어를 확장했다. controller 환경에서는 horizon이 중요하다. 아주 가까운 시점에 나타나는 risk를 더 약한 long-horizon risk와 같은 방식으로 해석하면 안 된다. 나는 model artifact와 report가 이 차이를 계속 드러내 주길 원했다.

python scripts/build_advanced_xgboost_dataset.py --clusters b,c,d,e,f,g
python scripts/train_advanced_xgboost.py --clusters b,c,d,e,f,g

중요한 것은 command 자체가 아니었다. 중요한 것은 advanced workspace를 분리해 generated feature, tuned model, report가 baseline path를 덮어쓰지 않고 따로 진화할 수 있게 만드는 일이었다.

Model code는 의도적으로 단순하게 유지했다

orchestrator 쪽 XGBoost brains를 위해 training function은 compact하게 유지했다. Risk는 binary였다. Demand는 regression이었다. 둘 다 histogram tree building을 사용했고, 나중에 live loop에서 로드할 수 있는 model file을 저장했다.

def train_safety_model(x: np.ndarray, y: np.ndarray, out_path: str | Path) -> Path:
    xgb = _require_xgboost()
    dtrain = xgb.DMatrix(x, label=y)
    params = {
        "max_depth": 6,
        "eta": 0.06,
        "subsample": 0.9,
        "colsample_bytree": 0.9,
        "objective": "binary:logistic",
        "eval_metric": "aucpr",
        "tree_method": "hist",
    }
    booster = xgb.train(params=params, dtrain=dtrain, num_boost_round=300)
    out = Path(out_path)
    out.parent.mkdir(parents=True, exist_ok=True)
    booster.save_model(str(out))
    return out

이 부분이 프로젝트에서 가장 독특한 부분은 아니었지만, 가장 중요한 부분 중 하나였다. 재현하기 쉽고 inspect하기 쉬운 model path가 필요했다. orchestration system은 이미 충분히 복잡해질 예정이었다. model-loading layer까지 또 하나의 미스터리로 만들고 싶지는 않았다.

Feature importance는 유용했지만 충분하지 않았다

feature importance report는 모델이 그럴듯한 signal에 주목하고 있는지 확인하는 데 도움이 됐다. utilization, request, rolling delta, scheduling과 priority field, machine-level context 같은 것들이다. 하지만 feature importance를 proof처럼 다루지 않으려 했다. 이것은 디버깅을 위한 관찰 창이지, 모델이 controller 안에서 잘 동작하리라는 보장은 아니다.

더 유용한 질문은 이렇게 바뀌었다. 모델이 agent가 추론할 수 있을 만큼 안정적인 risk와 demand 값을 만들 수 있는가? 모델의 validation metric이 괜찮아도, output이 너무 크게 흔들리거나 threshold가 계속 action flipping을 만들면 control loop 안에서는 제대로 동작하기 어렵다.

Threshold는 architecture의 일부가 되었다

처음에 agent threshold는 단순했지만, risk model에 behavioral meaning을 부여했다. 높은 risk score는 replicate가 될 수 있었다. 중간 risk score는 migrate나 throttle이 될 수 있었다. 낮은 risk는 no-op이 되었다. probability를 action으로 바꾸는 이 번역 지점에서 ML experiment가 orchestration experiment로 변하기 시작했다.

node_id, score = max(obs.p_fail_scores.items(), key=lambda kv: kv[1])
if score >= 0.83:
    return AgentAction("AgentA", ActionKind.REPLICATE, target=node_id, score=float(score))
if score >= 0.7:
    return AgentAction("AgentA", ActionKind.MIGRATE, target=node_id, score=float(score))
if score >= 0.5:
    return AgentAction("AgentA", ActionKind.THROTTLE, target=node_id, score=float(score))
return AgentAction("AgentA", ActionKind.NOOP, score=float(score))

이 threshold들에 대해서는 여러 번 생각을 바꿨다. 너무 낮으면 controller가 과하게 극적이고 noisy해 보였다. 너무 높으면 risk model은 장식이 되어 버렸다. 결국 나는 이 값들을 신성한 ML output이 아니라 controller parameter로 다루게 됐다.

결과 해석에서 부딪힌 지점

advanced model report는 유용했지만, 더 큰 systems question에 혼자 답해 주지는 않았다. AUCPR, precision at top-k, calibration bin, feature importance는 모두 모델이 sane한지 판단하는 데 도움이 됐다. 하지만 agent가 그 score를 사용했을 때 cluster가 실제로 더 나아질지는 말해 주지 않았다.

이 깨달음이 프로젝트를 six-layer orchestrator 쪽으로 밀어냈다. model output이 observation이 되고, observation이 proposal이 되고, proposal들이 충돌하고, dashboard가 그 충돌을 보여 주는 장소가 필요했다. 그것이 없으면 남는 것은 model file 더미와 약한 이야기뿐이었다.

이 단계가 남긴 것

advanced XGBoost 단계가 끝날 무렵, 프로젝트에는 controller에 먹일 만큼의 model infrastructure가 생겼다. risk model과 demand model은 더 이상 고립된 experiment가 아니었다. 곧 더 큰 system의 brains layer가 될 준비를 하고 있었다.

그 더 큰 system에서야 프로젝트는 내가 Kubernetes를 바라보며 처음 원했던 것의 모습에 가까워지기 시작했다. 단순한 prediction이 아니라, 눈에 보이는 control behavior였다.

이 단계에서 프로젝트는 model training에서 model interpretation으로 넘어갔다. 이 차이는 중요하다. metric은 누구나 읽을 수 있지만, 그 metric이 무엇을 결정해도 되고 무엇을 결정하면 안 되는지 설명하는 것은 engineer의 일이다.