[Borg-Orchestrator 07] Reward Tuning with Optuna, Ray, and RLlib

Once the live loop existed, the problem shifted from model performance to control behavior. A controller can be accurate in one narrow sense and still behave badly if the reward function teaches it the wrong priority.

That is why Optuna and Ray/RLlib entered the project. I did not want tuning to remain invisible. If reward weights changed, the dashboard needed to show the trial. If RL was too expensive to keep in the fast loop, that decision also needed to be visible. The experiment had to preserve the reasoning trail, not only the final chosen parameters.

This phase was about making optimization accountable. The question was not simply which configuration wins. It was whether the search process itself could be inspected well enough to trust the behavior it produced.

The reward problem

The orchestrator had three competing instincts: keep tasks alive, reduce waste, and protect the queue. I represented those as Agent A, Agent B, and Agent C rewards. The combined score used weights alpha, beta, and gamma. At first that felt too simple, but it was useful because I could see how changing the weights changed controller personality.

@dataclass(slots=True)
class Score:
    raw_rewards: dict[str, float]
    alpha: float = 1.0
    beta: float = 1.0
    gamma: float = 1.0

    @property
    def total(self) -> float:
        return (
            self.alpha * self.raw_rewards.get("AgentA", 0.0)
            + self.beta * self.raw_rewards.get("AgentB", 0.0)
            + self.gamma * self.raw_rewards.get("AgentC", 0.0)
        )

I did not treat this as a perfect reward design. It was a practical controller surface. If Agent A dominated everything, the system became safety-heavy. If Agent B was too attractive, the controller could chase efficiency while queue health was poor. If Agent C was too strong, admission behavior could become overly defensive.

Optuna as a visible search process

Optuna gave me a way to tune reward weights and policy parameters without pretending I had discovered perfect constants by hand. The important part was exporting trial history and reflecting it in the dashboard. I wanted to see completed trials, best values, and selected parameters as part of the runtime, not as a separate notebook I would forget to update.

study = optuna.create_study(
    direction="maximize",
    storage=storage,
    study_name=study_name,
    load_if_exists=True,
)

state.optuna_history(
    study_name,
    history,
    best_value=study.best_value if study.best_trial else None,
    best_params=dict(study.best_params) if study.best_trial else {},
    status="running",
)

The trial values were less important than the workflow. A controller experiment should make its tuning state visible. If I publish a dashboard screenshot where Optuna is disabled, the reader should know it was a fast run. If I publish one where Optuna completed trials, the dashboard should show the best parameters and history.

Ray/RLlib was useful but expensive to keep in the loop

Ray/RLlib gave me a more formal multi-agent policy path. The environment exposed AgentA, AgentB, and AgentC as separate agents with shared observation vectors and separate action spaces. That matched the architecture well, but local development made me careful. PPO bootstrap can slow down iteration, so I often kept it disabled while debugging live Kubernetes behavior.

self.possible_agents = ["AgentA", "AgentB", "AgentC"]

self.observation_spaces = {
    "AgentA": Box(low=0.0, high=1.0, shape=(6,), dtype=float),
    "AgentB": Box(low=0.0, high=1.0, shape=(6,), dtype=float),
    "AgentC": Box(low=0.0, high=1.0, shape=(6,), dtype=float),
}

self.action_spaces = {
    "AgentA": Discrete(POLICY_SPACES["AgentA"].action_count),
    "AgentB": Discrete(POLICY_SPACES["AgentB"].action_count),
    "AgentC": Discrete(POLICY_SPACES["AgentC"].action_count),
}

Why the dashboard mattered here

Tuning without dashboard state felt like changing knobs in a dark room. The learning/reward view let me see whether reward history, current action, agent proposals, Optuna state, and Ray status agreed with the run mode. When something looked off, I could tell whether I had a tuning problem, a disabled-feature problem, or a live data problem.

This phase also made me less interested in claiming one magic policy. The project was more interesting as a visible control experiment. Sometimes I wanted the deterministic agents and referee because they were readable. Sometimes I wanted Optuna search. Sometimes I wanted RLlib policy bootstrapping. The dashboard needed to show which one was active.

What This Phase Proved

By the end of reward tuning, the system could expose reward trajectories, optimization state, and policy state alongside live controller decisions. The next step was harder: comparing the experimental controller against a baseline that Kubernetes people actually recognize, HPA plus Karpenter-like capacity behavior.

Reward tuning made the project feel closer to real control engineering. It forced every objective to compete with another objective, and it made the tradeoffs visible enough to discuss instead of hiding them inside a single score.