[Borg-Orchestrator 02] Scaling the Borg Dataset into an XGBoost Workspace

After the forecaster idea became concrete, the project stopped feeling like an ML problem and became a data-engineering problem. That was not a downgrade. It was the part that decided whether the later model would mean anything.

Borg traces are not friendly product analytics tables. Usage windows, task events, machine information, scheduling metadata, terminal states, and generated artifacts all carry different meanings. If those meanings drift while preparing features, the model can still train, the charts can still look polished, and the conclusion can still be wrong.

So this phase was about building a workspace that made the boring parts explicit: where raw data lives, where processed data lives, how joins are performed, how labels are preserved, and how each run leaves behind artifacts that can be inspected later.

The external processed tree

I kept raw and processed data under ~/Documents because the repository should not become a landfill of generated files. The baseline path and the advanced XGBoost workspace were also separated. That decision made later experimentation safer because I could rebuild one track without destroying the other.

~/Documents/borg_data
~/Documents/borg_processed

~/Documents/borg_xgboost_workspace/raw
~/Documents/borg_xgboost_workspace/processed
~/Documents/borg_xgboost_workspace/models
~/Documents/borg_xgboost_workspace/reports
~/Documents/borg_xgboost_workspace/runtime
~/Documents/borg_xgboost_workspace/config

This looks like housekeeping, but it changed how I worked. When I ran a repair script or a retraining command, I could tell whether the artifact belonged to the first baseline task or the advanced model track. That mattered once I started producing multi-horizon XGBoost models and orchestrator traces in parallel.

Joining usage, events, and machines

The first real dataset was a joined table. Usage rows carried the time-window behavior. Events carried terminal state and task lifecycle evidence. Machines added context. I used Polars because I wanted streaming scans over parquet instead of loading everything into memory and hoping for the best.

usage = pl.scan_parquet(str(usage_path))
events = pl.scan_parquet(str(events_path))
machines = pl.scan_parquet(str(machines_path))

dataset = (
    usage
    .join(events, on=["collection_id", "instance_index"], how="left", suffix="_event")
    .join(machines, on="machine_id", how="left", suffix="_machine")
    .collect(engine="streaming")
)

dataset.write_parquet(output_path)

This join was one of those points where I had to stay suspicious. A successful parquet write did not mean the dataset was correct. I checked row counts, positive label rates, missing machine ids, and whether terminal event timing still made sense after the join.

Why I did not jump straight to advanced features

I was tempted to build the complicated feature set immediately. Rolling windows, deltas, request ratios, priority encodings, cluster-level pressure, all of that was more interesting than verifying joins. But doing that too early made every later error harder to isolate. So I kept the first track intentionally plain: produce the joined dataset, produce a forecaster frame, train a baseline, inspect the predictions.

The baseline was not meant to be the final model. It was a sanity instrument. If the baseline could not produce a plausible risk ranking, then the advanced model would only hide the failure behind more parameters.

The first baseline training loop

The baseline training script standardized the forecaster frames, split validation data, trained a model, and wrote the outputs I knew I would want later: validation predictions and top risk alerts. I cared about those files because they let me inspect actual rows instead of only reading aggregate metrics.

python scripts/train_forecaster_baseline.py   --clusters b,c,d,e,f,g   --feature-profile baseline

The useful outputs were not just the model file. The validation predictions parquet gave me a ranked surface to inspect. The top-risk alerts parquet gave me the first version of what later became the control-plane idea: risk is more useful when it can be attached to an action candidate.

validation_predictions.parquet: scored validation rows, sorted later by risk_score.
top_risk_alerts.parquet: high-risk rows kept for inspection.
cluster forecaster parquet files: per-cluster feature/label frames.
metrics text and reports: enough context to compare later runs.

The first friction-heavy trial-and-error loop

The difficult part was that every small pipeline change could invalidate something downstream. I would fix an optional column, then a cluster would produce different feature coverage. I would change the label horizon, then the positive rate would move. I would repair a schema issue, then model results would look better, but I still had to check whether the improvement was real or just caused by a different filtered dataset.

This is where I started building the habit that carried through the rest of the project: every artifact needed to be inspectable. A model score alone was not enough. A dashboard alone was not enough. A parquet file without a local explanation was not enough. I wanted the project to be explicit enough that I could keep moving without fooling myself.

What This Phase Proved

By the end of this phase, the project had a stable baseline data path. It could read the joined Borg-derived datasets, build forecaster frames with a 15-minute-style failure target, train a baseline forecaster, and write predictions I could inspect. It was still not the orchestrator. It was the first reliable measuring tool.

The next problem was more painful: schema drift and broken labels. That was the part where the pipeline stopped being a clean sequence and became a repair project.

The output of this stage was not a clever model. It was a workspace that made later experiments accountable. That is the kind of foundation that looks unglamorous in a demo, but quietly determines whether every later claim is defensible.