Most infrastructure projects that are worth finishing begin as a vague discomfort before they become architecture. This one started with a very specific kind of frustration: I could watch Kubernetes react, but I could not always explain the timing before the damage became visible.
At the company where I was working, I spent a lot of time around EKS, Kubernetes traces, GitHub changes, kubectl output, HPA behavior, and Karpenter behavior. From far away the cluster could look calm, but the operational story underneath was rarely calm. Workloads pushed CPU, HPA reacted later than I wanted, Karpenter added capacity on its own cadence, pods sat Pending, and the evidence lived across dashboards, terminal output, and small fragments of event history.
The first version of this project was an attempt to turn that discomfort into a measurable system. Traces would go in, failure or demand risk would come out, agents would propose actions, a referee would select one, Kubernetes would react, and a dashboard would show the whole chain without hiding the uncomfortable parts.
The first shape of the idea
The earliest version of the project was much smaller than what it became. I wanted a failure forecaster. Given resource usage and terminal event information, could I score which tasks were likely to fail within a future window? That question was attractive because it was technical enough to be concrete, but close enough to Kubernetes operations to matter.
- Borg traces would provide task and machine behavior over time.
- Feature windows would summarize CPU, memory, request, priority, scheduling class, and recent deltas.
- A target label would mark terminal failure within a fixed horizon.
- A baseline model would expose the first obvious risk score.
- Later, that risk score could become an input to an orchestrator instead of staying as a notebook metric.
The part I underestimated was the word later. A model alone is very easy to overvalue. If I only trained a classifier and printed AUCPR, the project would have stopped at a static ML experiment. The more I worked on it, the more I cared about the path from model output to control decision. That is why the project eventually became dashboard-heavy. The dashboard was not decoration. It was the only way I could keep myself honest about what the system was actually doing.
Directory layout before modeling
I kept the large data outside the repository from the beginning. The Borg data was too large and too mechanical to belong in git, and I did not want generated parquet files mixed into the source tree. The repo became the code and documentation layer. The external directories became the raw, processed, model, and report layer.
mkdir -p ~/Documents/borg_data
mkdir -p ~/Documents/borg_processed
mkdir -p ~/Documents/borg_xgboost_workspace/{raw,processed,models,reports,runtime,config}
export BORG_DATA_DIR="$HOME/Documents/borg_data"
export BORG_PROCESSED_DIR="$HOME/Documents/borg_processed"
export BORG_XGBOOST_WORKSPACE="$HOME/Documents/borg_xgboost_workspace"
That split was a small decision, but it saved me later. Once I had baseline forecaster data, advanced XGBoost features, orchestrator runtime traces, Optuna reports, and dashboard screenshots, I needed to know which artifact belonged to which track. Otherwise every failed run would have left behind files with unclear meaning.
Why Borg traces instead of only Kubernetes metrics
I could have started from live Kubernetes metrics only. That would have been more immediately familiar: pod CPU, memory, HPA desired replicas, Pending pods, node readiness. But I wanted more than a reactive demo. Borg traces gave me a dataset where machine/task behavior could be processed offline, repaired, and replayed. They also forced me to think in terms of terminal events and prediction horizons, which is exactly the thing I felt was missing when watching Kubernetes react after the fact.
The first baseline command was intentionally boring. I wanted reproducible CLI entrypoints before I wanted clever architecture.
python scripts/make_dataset.py
python scripts/make_forecaster_dataset.py
python scripts/train_forecaster_baseline.py
The first target label
The first important label was simple on paper: mark a row positive if the task has a failure terminal event within the prediction horizon. The actual implementation was where the first real care was needed. It had to avoid terminal events that had already happened before the usage window ended, and it had to keep the time arithmetic explicit.
dataset = pl.scan_parquet(dataset_file(cluster_id))
frame = (
dataset
.sort(["collection_id", "instance_index", "end_time"])
.with_columns([
(pl.col("last_event_time") - pl.col("end_time")).alias("time_to_terminal_event_us"),
pl.col("final_event_type").is_in(failure_event_types).alias("is_failure_terminal_event"),
])
.with_columns([
(
pl.col("is_failure_terminal_event")
& pl.col("time_to_terminal_event_us").is_not_null()
& (pl.col("time_to_terminal_event_us") >= 0)
& (pl.col("time_to_terminal_event_us") <= horizon_us)
).alias("failure_within_horizon")
])
)
This is the kind of code that looks plain, but it decides whether the entire project is meaningful. If the label leaks future state incorrectly, every later model and dashboard can look impressive while being nonsense. I had to keep reminding myself that the project was only as good as the boring label plumbing.
What I wanted the project to become
At this point I was not thinking in terms of a final dashboard yet. I was thinking in terms of an experiment that could eventually answer questions like: if risk rises before a visible failure, what should the controller do? If demand is low, can efficiency actions happen without stepping on safety? If queue health degrades, should admission control override power savings? And if all of that happens, can I see the actual reasoning and not just the final number?
That is why the first entry in this series starts here, before the dashboard screenshots and before the multi-agent stack. The Kubernetes operating friction came first. The model was my first attempt to make that frustration measurable. The dashboard came later because I eventually stopped trusting invisible experiments.
This phase mattered because it turned an operational feeling into an engineering question. The project was no longer about whether I could train a model on a public trace dataset. It was about whether prediction could become a visible, inspectable control signal. That distinction shaped every later decision in the series.