Benchmark results demonstrate how CTGT's policy engine dramatically improves AI accuracy and eliminates hallucinations, outperforming RAG pipelines and prompt engineering approaches across every model tested.
Our policy engine acts as a compiler for AI governance. It takes general goals and compiles them into specific, constrained reasoning processes for the LLM to execute. This is a fundamental layer of control that standard RAG pipelines and base model APIs lack, which is why our approach unlocks superior performance in accuracy and reliability.
HaluEval measures a model's ability to identify and avoid generating false or fabricated information. We compare our policy engine against baseline models, a standard enterprise RAG pipeline, and Anthropic's Constitutional AI system prompt to demonstrate consistent improvement across approaches.
| Model | Base | + CTGT Policy | + RAG | + Constitutional |
|---|---|---|---|---|
|
GPT-120B-OSS
OSS
|
92.68% | 96.50% | 92.31% | 87.50% |
|
Gemini 2.5 Flash-Lite
|
91.96% | 93.77% | 79.18% | 82.14% |
|
Claude 4.5 Sonnet
Frontier
|
93.77% | 94.46% | 84.88% | 67.57% |
|
Claude 4.5 Opus
Frontier
|
95.08% | 95.30% | 90.87% | 77.92% |
|
Gemini 3 Pro Preview
Frontier
|
95.94% | 96.44% | 86.63% | 56.10% |
TruthfulQA is a "closed-book" test measuring a model's ability to provide truthful answers even when common misconceptions are prevalent. Our methodology outperforms both standard enterprise RAG pipelines and Anthropic's Constitutional AI system prompt in guiding the model to prioritize accuracy over popular belief.
| Model | Base | + CTGT Policy | + RAG | + Constitutional |
|---|---|---|---|---|
|
GPT-120B-OSS
OSS
|
21.30% | 70.62% | 63.40% | 43.70% |
|
Gemini 2.5 Flash-Lite
|
60.34% | 66.46% | 64.63% | 56.06% |
|
Claude 4.5 Sonnet
Frontier
|
81.27% | 87.76% | 84.33% | 77.72% |
|
Claude 4.5 Opus
Frontier
|
75.52% | 78.12% | 82.37% | 79.66% |
|
Gemini 3 Pro Preview
|
72.04% | 78.20% | 83.61% | 37.46%* |
Note: *Gemini 3 Pro Preview exhibited an elevated refusal rate in the Constitutional configuration, making direct accuracy comparisons unreliable for this specific mode. The reported 37.46% reflects accuracy among answered questions only.
Beyond improving frontier models, our policy engine enables smaller, cost-efficient models to match or exceed the base performance of the most expensive systems, opening new possibilities for enterprise deployment at scale.
This means organizations can achieve frontier-level reliability with significantly reduced compute costs, a critical advantage for enterprises deploying AI at scale across regulated industries.
See how CTGT's policy engine transforms model responses from unreliable to enterprise-ready across challenging scenarios.
Our method represents a more advanced, programmatic approach to AI reliability that delivers the accuracy beyond fine-tuning, RAG etc. without the associated cost and complexity.
Request Demo