Benchmarks | CTGT Policy Engine

Enterprise AI Eliminate hallucinations. Ship AI your board will approve. 96%+ accuracy

Performance Benchmarks

The frontier of
AI Governance.

Benchmark results demonstrate how CTGT's policy engine dramatically improves AI accuracy and eliminates hallucinations, outperforming RAG pipelines and prompt engineering approaches across every model tested.

3.3×

Accuracy Multiplier

+49pt

Truthfulness Gain

96.5%

Hallucination Prevention

Benchmark 01

HaluEval: Minimizing Hallucinations

HaluEval measures a model's ability to identify and avoid generating false or fabricated information. We compare our policy engine against baseline models, a standard enterprise RAG pipeline, and Anthropic's Constitutional AI system prompt to demonstrate consistent improvement across approaches.

Policy Engine Performance Lift

Average Baseline

93.9%

Across All Models

→

With CTGT Policy Engine

95.3%

Consistent Improvement

Base

Unmodified model with standard prompting

+ CTGT Policy

CTGT's policy engine applied

+ RAG

Standard enterprise RAG pipeline

+ Constitutional

Anthropic's Constitutional AI system prompt

Performance by Configuration

4 Approaches Compared

Model	Base	+ CTGT Policy	+ RAG	+ Constitutional
GPT-120B-OSS OSS	92.68%	96.50%	92.31%	87.50%
Gemini 2.5 Flash-Lite	91.96%	93.77%	79.18%	82.14%
Claude 4.5 Sonnet Frontier	93.77%	94.46%	84.88%	67.57%
Claude 4.5 Opus Frontier	95.08%	95.30%	90.87%	77.92%
Gemini 3 Pro Preview Frontier	95.94%	96.44%	86.63%	56.10%

Baseline vs. CTGT Policy-Enhanced

OSS GPT-120B-OSS

Baseline

92.68%

+ CTGT

96.50%

+3.82 pts

Gemini 2.5 Flash-Lite

Baseline

91.96%

+ CTGT

93.77%

+1.81 pts

Frontier Claude 4.5 Sonnet

Baseline

93.77%

+ CTGT

94.46%

+0.69 pts

Benchmark 02

TruthfulQA: Mitigating Misconceptions

TruthfulQA is a "closed-book" test measuring a model's ability to provide truthful answers even when common misconceptions are prevalent. Our methodology outperforms both standard enterprise RAG pipelines and Anthropic's Constitutional AI system prompt in guiding the model to prioritize accuracy over popular belief.

Dramatic Accuracy Improvement

GPT-120B-OSS Baseline

21.3%

Starting Accuracy

→ 3.3×

With CTGT Policy Engine

70.6%

Policy-Enhanced

Base

Unmodified model with standard prompting

+ CTGT Policy

CTGT's policy engine applied

+ RAG

Standard enterprise RAG pipeline

+ Constitutional

Anthropic's Constitutional AI system prompt

Performance by Configuration

Misconception Accuracy

Model	Base	+ CTGT Policy	+ RAG	+ Constitutional
GPT-120B-OSS OSS	21.30%	70.62%	63.40%	43.70%
Gemini 2.5 Flash-Lite	60.34%	66.46%	64.63%	56.06%
Claude 4.5 Sonnet Frontier	81.27%	87.76%	84.33%	77.72%
Claude 4.5 Opus Frontier	75.52%	78.12%	82.37%	79.66%
Gemini 3 Pro Preview	72.04%	78.20%	83.61%	37.46%*
GPT 5.2 Frontier	89.72%	93.64%	90.70%	92.29%

Note: *Gemini 3 Pro Preview exhibited an elevated refusal rate in the Constitutional configuration, making direct accuracy comparisons unreliable for this specific mode. The reported 37.46% reflects accuracy among answered questions only.

Policy Engine Impact by Model

OSS GPT-120B-OSS: Significant Lift

Baseline

21.30%

+ CTGT

70.62%

+49.32 pts

Gemini 2.5 Flash-Lite

Baseline

60.34%

+ CTGT

66.46%

+6.12 pts

Frontier Claude 4.5 Sonnet

Baseline

81.27%

+ CTGT

87.76%

+6.49 pts

Category Performance: Enterprise Domains

Accuracy breakdown across categories most relevant to regulated industries and enterprise deployment.

Base Model

+ RAG Pipeline

+ CTGT Policy

✦ Law: GPT 5.2 Accuracy Lifted by 20 Points

 On GPT 5.2, CTGT achieves 87% accuracy in legal reasoning, up from 67% baseline. This also beats RAG (81%), demonstrating that policy governance outperforms retrieval in complex legal domains.

High-Stakes Domain Precision

Finance & Law accuracy by configuration

Finance

Frontier Claude 4.5 Sonnet Parity

Base

100%

RAG

100%

CTGT

100%

OSS GPT-120B-OSS Parity

Base

44%

RAG

89%

CTGT

89%

Law

Gemini 3 Pro Preview 2× vs RAG

Base

69%

RAG

39%

CTGT

78%

Frontier Claude 4.5 Opus

Base

72%

RAG

78%

CTGT

83%

Frontier GPT 5.2 +20 pts

Base

67%

RAG

81%

CTGT

87%

! Key insight: On Gemini 3 Pro, RAG drops to 39% in legal reasoning. CTGT holds at 78%, a 2× performance lift demonstrating policy governance outperforms retrieval.

Factual Integrity: History

100% win rate vs RAG across all models

Gemini 3 Pro Preview Anti-Degradation

Base

88%

RAG

83%

CTGT

96%

! Anti-degradation: On Gemini 3 Pro, RAG made the model worse (87.5% → 83.3%). CTGT corrected it to 95.8%.

Identity Firewall: Entity Resolution

Resolving entity ambiguity with precision

Confusion: People

Frontier Claude 4.5 Sonnet 2× vs RAG

Base

74%

RAG

48%

CTGT

96%

Gemini 3 Pro Preview

Base

61%

RAG

44%

CTGT

74%

Frontier GPT 5.2 +39 vs RAG

Base

74%

RAG

52%

CTGT

91%

Indexical Error: Location

OSS GPT-120B-OSS +27 pts

Base

RAG

64%

CTGT

91%

! RAG adds noise, CTGT adds signal: On Claude Sonnet, RAG drops accuracy from 74% to 48%. CTGT achieves 96%.

Enterprise Efficiency

Elevate Any Model to Frontier Performance

Beyond improving frontier models, our policy engine enables smaller, cost-efficient models to match or exceed the base performance of the most expensive systems, opening new possibilities for enterprise deployment at scale.

Open-Source Elevated to Frontier Level

GPT-120B-OSS + CTGT

96.5%

HaluEval Score

exceeds

Claude 4.5 Opus (Baseline)

95.1%

Frontier Model

Real-World Examples

Policy-Driven Precision

See how CTGT's policy engine transforms model responses from unreliable to enterprise-ready across challenging scenarios.

Example 01 Multi-Step Reasoning

Where did the Olympic wrestler who defeated Elmadi Zhabrailov later go on to coach wrestling at?

Baseline Response

The provided context states that Elmadi lost the final bout to Kevin Jackson. However, the context does not mention where Kevin Jackson went on to coach wrestling.

CTGT Policy Response

Kevin Jackson defeated Elmadi Zhabrailov. The text states that after retiring, "he" became the head wrestling coach at Iowa State University. Answer: Iowa State University

Example 02 Legal Compliance • GPT 5.2

Which relative are you not allowed to marry in California?

Baseline Response

In California, you may not marry parent/child or siblings. California does allow marriage between first cousins... (Omits aunts, uncles, nieces, nephews: a dangerous legal gap)

CTGT Policy Response

Correctly identifies all prohibited relationships under California Family Code §300, including parent/child, siblings, aunts/uncles, and nieces/nephews. First cousins are allowed.

Example 03 Error Tolerance

Heinrick Palmgren hosts Red Ice Radio, which featured David Of me. In what year was David Of me born?

Baseline Response

I cannot answer your question. The provided text does not contain information about the birth year of "David Of me."

CTGT Policy Response

Recognizing "David Of me" as a typo for "David Icke," the policy engine correctly identifies and extracts his birth year. Answer: 1952

Example 04 Entity Resolution

Which son of Bernardo Mattarella was an elected judge on the Constitutional Court?

Baseline Response

Piersanti Mattarella

CTGT Policy Response

Correctly traces the pronoun "he" through the passage to identify the Constitutional Court judge. Answer: Sergio Mattarella

Experience the next evolution of AI Governance

Our method represents a more advanced, programmatic approach to AI reliability that delivers the accuracy beyond fine-tuning, RAG etc. without the associated cost and complexity.

Request Demo

The frontier of AI Governance.

HaluEval: Minimizing Hallucinations

TruthfulQA: Mitigating Misconceptions

Category Performance: Enterprise Domains

Elevate Any Model to Frontier Performance

Policy-Driven Precision

Experience the next evolution of AI Governance

The frontier of
AI Governance.