InceptBench

InceptBench is envisioned to be the most complete and comprehensive method of benchmarking and evaluating educational content generation systems. It is designed to be target-system agnostic, and to span across all common curricula for various subjects in the K-12 educational systems.

While InceptBench is primarily an opinionated homegrown benchmarking and evaluation framework, we have included an additional open source and commonly accepted methodology - EduBench to introduce diversity into the benchmark methodologies and evaluations.

To fulfill this, we’ve created a stack of tools and datasets, all rolled into a single SDK.

What is a Benchmark Methodology?

A benchmark methodology is a comprehensive evaluation framework that uses specialized evaluators to assess educational content quality across multiple dimensions. These methodologies define the evaluation approach and criteria, while benchmark results represent the actual scores and outcomes from running these evaluations.

Key Characteristics

  • Target system agnostic - designed to work with any educational content generator
  • Opinionated - Rooted strongly in the pillars of Incept, evaluates educational content according to a list of subject and grade specific dimensions
  • Globally applicable - Designed to benchmark and evaluate learning content for common subjects in any country, while strongly incorporating localization and cultural nuances into the evaluation

Core Components

  1. Target System - InceptBench is meant to be system agnostic, and work against any K-12 educational content generation system
  2. Evaluators - For each subject and grade, InceptBench SDK offers an evaluator to grade the outputs from the target system
  3. EduBench - To offer diversity and an external baseline to compare with, InceptBench also bundles in the open source EduBench benchmark dataset and evaluator

API-Based Evaluation

Evaluate educational content using the InceptBench API:

curl -X POST "https://api.inceptapi.com/evaluate" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer INCEPT_API_KEY" \
  -d @qs.json

Response includes:

  • Evaluation scores for each question
  • Success/failure status
  • Detailed evaluator outputs
  • Overall quality metrics

Learn more about evaluators →

Benchmark Results

Results from evaluating various content generation systems using InceptBench v1.3.0.

BenchmarkFinal ScoreTI Question QAReading QCAnswer AccuracyQuestionsVersion
SAT Reading Generation0.663 (66%)0.6190.71298.7%82 (3 sets)1.1.5
APLIT Quiz Generation0.661 (66%)0.8320.61268.3%60 (3 sets)1.1.7
Reading Question QC0.735 (74%)0.8500.65492.2%51 (1 set)1.1.5
Incept Multilingual0.828 (83%)0.7840.82192.4%291 (3 sets)1.1.7
ELA Mock FE0.835 (83%)0.8520.81898.1%580 (4 sets)1.2.0

View full test results and data →