Run Benchmark
Validation & CharterEvaluate the production model on a certified slice. Compare baselines on Baseline Comparison.
Benchmark catalog (static)
Select a theme below or write your own. YAML exam runner ships in Phase B.
- BEIR MS MARCO slice · tier basic
- Multi-hop bridge docs · tier advanced
- HyDE query augmentation · tier advanced
- Hybrid BM25 + dense · tier intermediate
Validation Pilot — live search
Production retrieval path. Feedback is recorded for Validation → R&D review.
Need more benchmark runs?
Evaluation credits on the storefront fund additional harness runs and certification submissions. Self-serve certification is on /certify.
Buy evaluation credits →