Overview
Model evaluation is the most abused step in machine learning. Cross-validation on the training set is used to select hyperparameters, then the same cross-validation score is reported as the model's expected performance. The test set is used multiple times until the model performs well on it. The evaluation metric is chosen because it produces a high number, not because it matches the business objective. The result is a model that appears to perform well in development and fails in production.
A rigorous evaluation framework has three properties: the test set is used exactly once, the evaluation metric matches the business objective, and the validation strategy accounts for the data's structure (temporal, geographic, hierarchical). Without all three, the reported performance is optimistic by an unknown amount.
The Model Evaluation & Validation Framework Prompt generates a complete evaluation specification: validation strategy selection by data structure, metric selection by business objective, diagnostic protocol for overfitting and distribution shift, and a production monitoring plan that detects performance degradation after deployment.
What you get: - Validation strategy selection by data structure - Business-objective-driven metric selection - Overfitting and underfitting diagnostic protocol - Distribution shift detection - Production performance monitoring plan
Built for: ML engineers and data scientists who need evaluation frameworks that produce honest performance estimates and catch degradation in production.