[Borg-Orchestrator 04] Interpreting Multi-Horizon XGBoost Failure Models

Once the data stopped looking obviously suspicious, the next risk was overtrusting the model. A good validation number can be seductive, but an orchestration system cannot consume a metric the same way a notebook does.

The advanced XGBoost track had to answer a more practical question: does the model expose signals that a controller could reason about across time horizons? Short-horizon risk, long-horizon risk, resource pressure, priority, churn, and recent deltas all tell different stories. Collapsing them into one impressive number would have hidden the part that mattered.

So I treated the model as an input to system design. The goal was not simply to improve AUCPR. The goal was to understand which signals were stable enough to become part of a control path.

Multi-horizon thinking

The first baseline target was basically: failure within the configured horizon. The advanced track expanded that idea. In a controller setting, the horizon matters. A risk that appears very near-term should not be interpreted the same way as a weaker, longer-horizon risk. I wanted the model artifacts and reports to keep that distinction visible.

python scripts/build_advanced_xgboost_dataset.py --clusters b,c,d,e,f,g
python scripts/train_advanced_xgboost.py --clusters b,c,d,e,f,g

The important part was not the command itself. The important part was isolating the advanced workspace so the generated features, tuned models, and reports could evolve without overwriting the baseline path.

The model code stayed intentionally plain

For the orchestrator-side XGBoost brains, I kept the training function compact. Risk was binary. Demand was regression. Both used histogram tree building and saved model files that the live loop could load later.

def train_safety_model(x: np.ndarray, y: np.ndarray, out_path: str | Path) -> Path:
    xgb = _require_xgboost()
    dtrain = xgb.DMatrix(x, label=y)
    params = {
        "max_depth": 6,
        "eta": 0.06,
        "subsample": 0.9,
        "colsample_bytree": 0.9,
        "objective": "binary:logistic",
        "eval_metric": "aucpr",
        "tree_method": "hist",
    }
    booster = xgb.train(params=params, dtrain=dtrain, num_boost_round=300)
    out = Path(out_path)
    out.parent.mkdir(parents=True, exist_ok=True)
    booster.save_model(str(out))
    return out

This was not the most exotic part of the project, but it was one of the most important. I needed a model path that was easy to reproduce and easy to inspect. The orchestration system would already be complex enough. I did not want the model-loading layer to become another mystery.

Feature importance was useful, but not enough

Feature importance reports helped me check whether the model was paying attention to plausible signals: utilization, requests, rolling deltas, scheduling and priority fields, and machine-level context. But I was careful not to treat feature importance as proof. It is a debugging lens, not a guarantee that the model will behave well inside a controller.

The more useful question became: can the model produce risk and demand values that are stable enough for an agent to reason about? A model can have an acceptable validation metric and still behave poorly inside a control loop if its output jumps too much or if the thresholds create constant action flipping.

Thresholds became part of the architecture

The agent thresholds were simple at first, but they gave the risk model a behavioral meaning. A high risk score could become replicate. A medium risk score could become migrate or throttle. Low risk became no-op. That translation from probability to action was where the ML experiment started becoming an orchestration experiment.

node_id, score = max(obs.p_fail_scores.items(), key=lambda kv: kv[1])
if score >= 0.83:
    return AgentAction("AgentA", ActionKind.REPLICATE, target=node_id, score=float(score))
if score >= 0.7:
    return AgentAction("AgentA", ActionKind.MIGRATE, target=node_id, score=float(score))
if score >= 0.5:
    return AgentAction("AgentA", ActionKind.THROTTLE, target=node_id, score=float(score))
return AgentAction("AgentA", ActionKind.NOOP, score=float(score))

I changed my mind several times about these thresholds. If they were too low, the controller looked dramatic and noisy. If they were too high, the risk model became decorative. I eventually treated them as controller parameters rather than sacred ML outputs.

The awkward part of interpreting results

The advanced model reports were useful, but they did not answer the bigger systems question by themselves. AUCPR, precision at top-k, calibration bins, and feature importance all helped me decide whether the model was sane. They did not tell me whether a cluster would be better off when an agent used those scores.

That realization pushed the project toward the six-layer orchestrator. I needed a place where model outputs became observations, observations became proposals, proposals conflicted, and the dashboard showed the conflict. Without that, I would only have a pile of model files and a weak story.

What This Phase Proved

By the end of the advanced XGBoost phase, the project had enough model infrastructure to feed the controller. The risk model and demand model were no longer isolated experiments. They were about to become the brains layer in a larger system.

That larger system is where the project started to look like the thing I originally wanted while staring at Kubernetes: not just prediction, but visible control behavior.

This phase moved the project from model training toward model interpretation. That distinction matters. A recruiter can read a metric; an engineer has to explain what that metric should and should not be allowed to decide.