The first time the pipeline looked stable, I did not trust it. That instinct was useful. Stable output is not the same thing as correct output, especially when the data is large enough to hide small mistakes.

The dangerous failures were not the loud crashes. Loud crashes stop the run and force attention. The failures that worried me were quieter: nullable fields interpreted too casually, terminal events attached to the wrong temporal direction, schema changes that still produced parquet files, and labels that looked plausible while poisoning the training set.

This phase was about treating data repair as part of the system, not as cleanup after the real work. If the project was going to make decisions from model output, label integrity had to become an explicit engineering concern.

Schema drift as a real error

The orchestrator side eventually had to ingest grouped traces, synthetic traces, and live Kubernetes snapshots. That made schema validation more important than I expected. I wrote validators that treated drift as a named problem instead of letting Python fail somewhere later with a vague KeyError or TypeError.

def _parse_int(value: Any, *, field: str, row_index: int) -> int:
    try:
        return int(value)
    except (TypeError, ValueError) as exc:
        raise ValueError(
            f"Schema drift: row[{row_index}] field {field!r} must be integer-like, got {value!r}."
        ) from exc

def _validate_grouped_row(row: dict[str, Any], row_index: int) -> None:
    if "timestamp" not in row:
        raise ValueError(f"Schema drift: grouped row[{row_index}] missing timestamp key.")
    _parse_int(row["timestamp"], field="timestamp", row_index=row_index)

That style made the code a little less pretty, but it made failures readable. When I was running long local experiments, readable failure messages mattered more than elegance.

The label bug I kept checking for

The most important label check was temporal. A row should not become positive because of a terminal event that had already happened before the usage window ended. That sounds obvious, but when the dataset is built from separate usage and event sources, it is easy to accidentally make future knowledge leak into the feature row.

The repair was to keep the time-to-terminal calculation visible and to preserve a separate flag for terminal events that were already before the window end.

(pl.col("is_failure_terminal_event")
 & pl.col("time_to_terminal_event_us").is_not_null()
 & (pl.col("time_to_terminal_event_us") >= 0)
 & (pl.col("time_to_terminal_event_us") <= horizon_us)
).alias("failure_within_horizon")

(pl.col("final_event_type").is_not_null()
 & (pl.col("last_event_time") < pl.col("end_time"))
).alias("terminal_event_before_window_end")

I did not want to delete that second flag just because it was not the training target. It was useful during debugging because it gave me a way to see when terminal state was attached to a row in a suspicious way.

Repair reports, not just repaired code

Another thing I changed in this phase was how I recorded progress. If I only fixed scripts and moved on, I would forget why a repair existed. The reports in the repository became a lightweight lab notebook: what was broken, what was repaired, what remained risky, and which generated artifacts were expected after the next run.

That habit helped later when the project split into several tracks: baseline forecaster, advanced XGBoost, orchestrator stack, Optuna/Ray tuning, local dual-cluster comparison, and dashboard work. Without progress reports, those tracks would have blurred together.

The repair loop was slow but necessary

The least fun part was re-running work after a repair. A schema fix could force regenerated parquet. A label fix could force retraining. A feature change could make older metrics no longer comparable. It felt slow, but it was better than building a dashboard on top of a rotten label definition.

I had a simple rule during this part: if I could not explain what a row meant, I did not want to train on it. That rule sounds a bit dramatic, but it kept the project grounded. Later, when I watched dashboards show risk, queue length, decisions, and reward traces, I knew those values came from contracts I had already fought with.

What changed after the repairs

After the repair pass, I was more comfortable moving to advanced modeling. The target label had a clearer temporal meaning, grouped trace ingestion had stricter validation, and artifact layout was less ambiguous. I still expected problems, but the project was no longer held together by hope and print statements.

This is also where the project started to feel less like a one-off classifier and more like a systems experiment. The data contract, not the model, became the center of gravity.

The repair pipeline did more than make the scripts pass. It changed the trust boundary of the project. From this point on, a successful run had to produce not only data, but evidence that the data deserved to be used.