InceptBench

InceptBench is envisioned to be the most complete and comprehensive method of benchmarking and evaluating educational content generation systems. It is designed to be target-system agnostic, and to span across all common curricula for various subjects in the K-12 educational systems.

While InceptBench is primarily an opinionated homegrown benchmarking and evaluation framework, we have included an additional open source and commonly accepted methodology - EduBench to introduce diversity into the benchmark methodologies and evaluations.

To fulfill this, we’ve created a stack of tools and datasets, all rolled into a single SDK.

What is a Benchmark Methodology?

A benchmark methodology is a comprehensive evaluation framework that uses specialized evaluators to assess educational content quality across multiple dimensions. These methodologies define the evaluation approach and criteria, while benchmark results represent the actual scores and outcomes from running these evaluations.

Key Characteristics

Target system agnostic - designed to work with any educational content generator
Opinionated - Rooted strongly in the pillars of Incept, evaluates educational content according to a list of subject and grade specific dimensions
Globally applicable - Designed to benchmark and evaluate learning content for common subjects in any country, while strongly incorporating localization and cultural nuances into the evaluation

Core Components

Target System - InceptBench is meant to be system agnostic, and work against any K-12 educational content generation system
Evaluators - For each subject and grade, InceptBench SDK offers an evaluator to grade the outputs from the target system
EduBench - To offer diversity and an external baseline to compare with, InceptBench also bundles in the open source EduBench benchmark dataset and evaluator

API-Based Evaluation

Evaluate educational content using the InceptBench API:

curl -X POST "https://api.inceptapi.com/evaluate" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer INCEPT_API_KEY" \
  -d @qs.json

Response includes:

Evaluation scores for each question
Success/failure status
Detailed evaluator outputs
Overall quality metrics

Learn more about evaluators →

Benchmark Results

Results from evaluating various content generation systems using InceptBench v1.3.0.

Benchmark	Final Score	TI Question QA	Reading QC	Answer Accuracy	Questions	Version
SAT Reading Generation	0.663 (66%)	0.619	0.712	98.7%	82 (3 sets)	1.1.5
APLIT Quiz Generation	0.661 (66%)	0.832	0.612	68.3%	60 (3 sets)	1.1.7
Reading Question QC	0.735 (74%)	0.850	0.654	92.2%	51 (1 set)	1.1.5
Incept Multilingual	0.828 (83%)	0.784	0.821	92.4%	291 (3 sets)	1.1.7
ELA Mock FE	0.835 (83%)	0.852	0.818	98.1%	580 (4 sets)	1.2.0

View full test results and data →