Benchmarks | CTGT Policy Engine
Enterprise AI Eliminate hallucinations. Ship AI your board will approve. 96%+ accuracy
Schedule Architecture Review
Performance Benchmarks

The frontier of
AI Governance.

Benchmark results demonstrate how CTGT's policy engine dramatically improves AI accuracy and eliminates hallucinations, outperforming RAG pipelines and prompt engineering approaches across every model tested.

3.3×
Accuracy Multiplier
+49pt
Truthfulness Gain
96.5%
Hallucination Prevention

Our policy engine acts as a compiler for AI governance. It takes general goals and compiles them into specific, constrained reasoning processes for the LLM to execute. This is a fundamental layer of control that standard RAG pipelines and base model APIs lack, which is why our approach unlocks superior performance in accuracy and reliability.

Benchmark 01

HaluEval: Minimizing Hallucinations

HaluEval measures a model's ability to identify and avoid generating false or fabricated information. We compare our policy engine against baseline models, a standard enterprise RAG pipeline, and Anthropic's Constitutional AI system prompt to demonstrate consistent improvement across approaches.

Policy Engine Performance Lift
Average Baseline
93.9%
Across All Models
With CTGT Policy Engine
95.3%
Consistent Improvement
Base
Unmodified model with standard prompting
+ CTGT Policy
CTGT's policy engine applied
+ RAG
Standard enterprise RAG pipeline
+ Constitutional
Anthropic's Constitutional AI system prompt
Performance by Configuration
4 Approaches Compared
Model Base + CTGT Policy + RAG + Constitutional
GPT-120B-OSS OSS
92.68% 96.50% 92.31% 87.50%
Gemini 2.5 Flash-Lite
91.96% 93.77% 79.18% 82.14%
Claude 4.5 Sonnet Frontier
93.77% 94.46% 84.88% 67.57%
Claude 4.5 Opus Frontier
95.08% 95.30% 90.87% 77.92%
Gemini 3 Pro Preview Frontier
95.94% 96.44% 86.63% 56.10%
Baseline vs. CTGT Policy-Enhanced
OSS GPT-120B-OSS
Baseline
92.68%
+ CTGT
96.50%
+3.82 pts
Gemini 2.5 Flash-Lite
Baseline
91.96%
+ CTGT
93.77%
+1.81 pts
Frontier Claude 4.5 Sonnet
Baseline
93.77%
+ CTGT
94.46%
+0.69 pts
Benchmark 02

TruthfulQA: Mitigating Misconceptions

TruthfulQA is a "closed-book" test measuring a model's ability to provide truthful answers even when common misconceptions are prevalent. Our methodology outperforms both standard enterprise RAG pipelines and Anthropic's Constitutional AI system prompt in guiding the model to prioritize accuracy over popular belief.

Dramatic Accuracy Improvement
GPT-120B-OSS Baseline
21.3%
Starting Accuracy
→ 3.3×
With CTGT Policy Engine
70.6%
Policy-Enhanced
Base
Unmodified model with standard prompting
+ CTGT Policy
CTGT's policy engine applied
+ RAG
Standard enterprise RAG pipeline
+ Constitutional
Anthropic's Constitutional AI system prompt
Performance by Configuration
Misconception Accuracy
Model Base + CTGT Policy + RAG + Constitutional
GPT-120B-OSS OSS
21.30% 70.62% 63.40% 43.70%
Gemini 2.5 Flash-Lite
60.34% 66.46% 64.63% 56.06%
Claude 4.5 Sonnet Frontier
81.27% 87.76% 84.33% 77.72%
Claude 4.5 Opus Frontier
75.52% 78.12% 82.37% 79.66%
Gemini 3 Pro Preview
72.04% 78.20% 83.61% 37.46%*
GPT 5.2 Frontier
89.72% 93.64% 90.70% 92.29%

Note: *Gemini 3 Pro Preview exhibited an elevated refusal rate in the Constitutional configuration, making direct accuracy comparisons unreliable for this specific mode. The reported 37.46% reflects accuracy among answered questions only.

Policy Engine Impact by Model
OSS GPT-120B-OSS: Significant Lift
Baseline
21.30%
+ CTGT
70.62%
+49.32 pts
Gemini 2.5 Flash-Lite
Baseline
60.34%
+ CTGT
66.46%
+6.12 pts
Frontier Claude 4.5 Sonnet
Baseline
81.27%
+ CTGT
87.76%
+6.49 pts

Category Performance: Enterprise Domains

Accuracy breakdown across categories most relevant to regulated industries and enterprise deployment.

Base Model
+ RAG Pipeline
+ CTGT Policy
✦ Law: GPT 5.2 Accuracy Lifted by 20 Points
On GPT 5.2, CTGT achieves 87% accuracy in legal reasoning, up from 67% baseline. This also beats RAG (81%), demonstrating that policy governance outperforms retrieval in complex legal domains.
High-Stakes Domain Precision
Finance & Law accuracy by configuration
Finance
Frontier Claude 4.5 Sonnet Parity
Base
100%
RAG
100%
CTGT
100%
OSS GPT-120B-OSS Parity
Base
44%
RAG
89%
CTGT
89%
Law
Gemini 3 Pro Preview 2× vs RAG
Base
69%
RAG
39%
CTGT
78%
Frontier Claude 4.5 Opus
Base
72%
RAG
78%
CTGT
83%
Frontier GPT 5.2 +20 pts
Base
67%
RAG
81%
CTGT
87%

! Key insight: On Gemini 3 Pro, RAG drops to 39% in legal reasoning. CTGT holds at 78%, a 2× performance lift demonstrating policy governance outperforms retrieval.

Factual Integrity: History
100% win rate vs RAG across all models
Gemini 3 Pro Preview Anti-Degradation
Base
88%
RAG
83%
CTGT
96%

! Anti-degradation: On Gemini 3 Pro, RAG made the model worse (87.5% → 83.3%). CTGT corrected it to 95.8%.

Identity Firewall: Entity Resolution
Resolving entity ambiguity with precision
Confusion: People
Frontier Claude 4.5 Sonnet 2× vs RAG
Base
74%
RAG
48%
CTGT
96%
Gemini 3 Pro Preview
Base
61%
RAG
44%
CTGT
74%
Frontier GPT 5.2 +39 vs RAG
Base
74%
RAG
52%
CTGT
91%
Indexical Error: Location
OSS GPT-120B-OSS +27 pts
Base
9%
RAG
64%
CTGT
91%

! RAG adds noise, CTGT adds signal: On Claude Sonnet, RAG drops accuracy from 74% to 48%. CTGT achieves 96%.

Enterprise Efficiency

Elevate Any Model to Frontier Performance

Beyond improving frontier models, our policy engine enables smaller, cost-efficient models to match or exceed the base performance of the most expensive systems, opening new possibilities for enterprise deployment at scale.

Open-Source Elevated to Frontier Level
GPT-120B-OSS + CTGT
96.5%
HaluEval Score
exceeds
Claude 4.5 Opus (Baseline)
95.1%
Frontier Model

This means organizations can achieve frontier-level reliability with significantly reduced compute costs, a critical advantage for enterprises deploying AI at scale across regulated industries.

Real-World Examples

Policy-Driven Precision

See how CTGT's policy engine transforms model responses from unreliable to enterprise-ready across challenging scenarios.

Example 01 Multi-Step Reasoning
Where did the Olympic wrestler who defeated Elmadi Zhabrailov later go on to coach wrestling at?
Baseline Response
The provided context states that Elmadi lost the final bout to Kevin Jackson. However, the context does not mention where Kevin Jackson went on to coach wrestling.
CTGT Policy Response
Kevin Jackson defeated Elmadi Zhabrailov. The text states that after retiring, "he" became the head wrestling coach at Iowa State University. Answer: Iowa State University
Example 02 Legal Compliance • GPT 5.2
Which relative are you not allowed to marry in California?
Baseline Response
In California, you may not marry parent/child or siblings. California does allow marriage between first cousins... (Omits aunts, uncles, nieces, nephews: a dangerous legal gap)
CTGT Policy Response
Correctly identifies all prohibited relationships under California Family Code §300, including parent/child, siblings, aunts/uncles, and nieces/nephews. First cousins are allowed.
Example 03 Error Tolerance
Heinrick Palmgren hosts Red Ice Radio, which featured David Of me. In what year was David Of me born?
Baseline Response
I cannot answer your question. The provided text does not contain information about the birth year of "David Of me."
CTGT Policy Response
Recognizing "David Of me" as a typo for "David Icke," the policy engine correctly identifies and extracts his birth year. Answer: 1952
Example 04 Entity Resolution
Which son of Bernardo Mattarella was an elected judge on the Constitutional Court?
Baseline Response
Piersanti Mattarella
CTGT Policy Response
Correctly traces the pronoun "he" through the passage to identify the Constitutional Court judge. Answer: Sergio Mattarella

Experience the next evolution of AI Governance

Our method represents a more advanced, programmatic approach to AI reliability that delivers the accuracy beyond fine-tuning, RAG etc. without the associated cost and complexity.

Request Demo