InceptBench UAE Mathematics

A rigorous curriculum-aligned benchmark methodology with specialized evaluators for validating the quality of mathematics questions, aligned with the UAE K-12 Mathematics curriculum. Each question is a fully structured item (question, options, correct answer, explanation, difficulty, grade alignment, pedagogy).

Evaluations emphasize Direct Instruction compliance, pedagogical appropriateness, and UAE localization, with redundant correctness checks to ensure questions are correct, grade-appropriate, well-explained, culturally localized (UAE), and production-ready through three independent evaluators.

Evaluators

InceptBench UAE Mathematics uses two independent evaluation frameworks to ensure comprehensive quality assessment from multiple perspectives.

Evaluator 1: InceptBench UAE Math Questions Evaluator

A specialized two-submodule framework specifically designed for UAE K-12 Mathematics questions. Each submodule operates autonomously and provides redundant validation to ensure maximum accuracy.

What It Evaluates:

Evaluates complete educational questions - including question text, answer options, correct answer, explanations, difficulty level, grade alignment, and pedagogical quality. It does NOT evaluate student responses. Think of it as quality control for the questions themselves before they reach students.

Submodule 1: Pedagogical Evaluator

The Pedagogical Evaluator is an automated, LLM-based assessment system powered by GPT-4 that evaluates educational questions across 10 distinct dimensions in a single evaluation pass. Unlike iterative approaches, this evaluator simultaneously assesses all dimensions, identifies issues and strengths, and generates actionable improvement suggestions.

10 Evaluation Dimensions

Educational Quality (70% weight):

  1. Correctness & Factual Accuracy - Mathematical accuracy, factual correctness, answer key validity. Auto-reject if answer key invalid.
  2. Grade Level Appropriateness - Complexity and content match target grade level.
  3. Difficulty Consistency - Actual difficulty matches declared level.
  4. Language & Clarity - Grammar, clarity, language appropriateness for grade.
  5. Educational Impact - Learning potential and educational value.
  6. Explanation Quality - Explanations guide learning vs stating answers. Flag “leakage” if reveals exact option text.
  7. Instruction Adherence - Adherence to specified requirements and format.
  8. Format & Structure - Structural correctness (MCQ options, answer format).

Direct Instruction Compliance (20% weight):

  1. DI Compliance - Adherence to Direct Instruction principles, scaffolding formats, grade-level language. Weighted: Principles (40%) + Format (35%) + Language (25%). DI is a research-based teaching methodology emphasizing clear, explicit instruction and systematic skill development.

Cultural Relevance (10% weight):

  1. Cultural & Localization Relevance - Names, locations, objects, and contexts culturally appropriate for UAE students. Arabic/GCC names (Ahmed, Fatima, Khalid), UAE locations (Burj Khalifa, Dubai Mall), regional food (dates, kunafa), AED currency, cultural events (Ramadan, Eid), Friday-Saturday weekend.
💬 Scoring Formula

Overall Score = (Academic Dimensions 1-8 × 70%) + (DI Compliance × 20%) + (Cultural Localization × 10%)

All dimensions use 0-10 scale, normalized to 0.0-1.0.

Quality Recommendations

  • ACCEPT: Correctness ≥ 0.6, Format ≥ 0.6, DI ≥ 0.7, Query Relevance ≥ 0.7, Overall ≥ 0.7, No critical issues → Ready for production.
  • REVISE: Structure OK and answer correct, minor issues (topic drift, weak explanation, minor format flaws) → Needs refinement.
  • REJECT: Answer mapping error, correct answer not present, Query Relevance < 0.4, Correctness < 0.4, Format < 0.4, DI < 0.3 → Requires complete regeneration.

Submodule 2: Answer Verification

Answer Verification is an independent correctness validation submodule powered by GPT-5 that operates separately from the Pedagogical Evaluator to provide redundant correctness checking. This ensures no incorrect question passes through even if the Pedagogical Evaluator misses an error.

What It Validates

  • Mathematical Accuracy: Verifies calculations, formulas, and mathematical logic
  • Answer Key Validity: Confirms the declared correct answer is actually correct
  • Calculation Correctness: Validates all computational steps
  • Logic Soundness: Ensures reasoning and problem-solving approaches are valid
💬 Output
Returns is_correct (true/false) and confidence (0-10) per question.

Redundant Safety Net

Both Answer Verification and the Pedagogical Evaluator’s Correctness dimension must confirm correctness before accepting a question. This dual-validation approach significantly reduces the risk of incorrect content reaching students.

Quality Decision Logic

💬 Decision Criteria
  • ACCEPT if Pedagogical ≥ 85%
  • REVISE for minor issues
  • REJECT for critical failures

Both submodules must align for production deployment.

Evaluator 2: EduBench

A comprehensive educational content evaluator that assesses questions across 5 key metrics, providing detailed scores and rationales. Powered by GPT-5 with curriculum-aligned validation.

What It Evaluates:

EduBench automatically classifies content type (Question, Quiz, Reading Passage, or Other) and applies specialized evaluation criteria for each. For mathematics questions, it focuses on curriculum alignment, pedagogical quality, and educational effectiveness.

The 5 Evaluation Metrics

  1. Curriculum Alignment (0-5) - Evaluates how well the question aligns with educational standards, learning objectives, and assessment boundaries. Considers standard alignment, learning objectives, and compliance with assessment boundaries.

  2. Pedagogical Quality (0-5) - Assesses instructional effectiveness, conceptual depth, and learning support. Evaluates whether the question promotes understanding, uses appropriate instructional strategies, and supports student learning.

  3. Clarity & Accessibility (0-5) - Evaluates question clarity, language appropriateness, and accessibility. Ensures questions are free from ambiguity, use grade-appropriate vocabulary, and are accessible to diverse learners.

  4. Accuracy & Rigor (0-5) - Verifies factual correctness, mathematical accuracy, and intellectual rigor. Checks for errors, validates answer keys, and ensures appropriate challenge level.

  5. Engagement & Relevance (0-5) - Measures student engagement potential and real-world relevance. Evaluates whether questions are interesting, motivating, and connected to meaningful contexts.

💬 Overall Rating

EXEMPLARY (4.0-5.0): Exceptional quality across all metrics

ADEQUATE (3.0-3.9): Meets standards with minor areas for improvement

INFERIOR (0-2.9): Significant quality issues requiring revision

Why Two Evaluators?

Using two independent evaluation frameworks provides:

  • Redundant Validation: Critical issues are caught by multiple systems
  • Diverse Perspectives: UAE-specific evaluation + general educational best practices
  • Comprehensive Coverage: 10 dimensions (InceptBench) + 5 metrics (EduBench) = 15 unique quality checks
  • Higher Confidence: Questions must pass both evaluators for production deployment

Benchmark Results

ModelEduBench ScorePedagogical ScoreAnswer CorrectnessLatency
Coming Soon----