InceptBench
InceptBench is envisioned to be the most complete and comprehensive method of benchmarking and evaluating educational content generation systems. It is designed to be target-system agnostic, and to span across all common curricula for various subjects in the K-12 educational systems.
While InceptBench is primarily an opinionated homegrown benchmarking and evaluation framework, we have included an additional open source and commonly accepted methodology - EduBench to introduce diversity into the benchmark methodologies and evaluations.
To fulfill this, we’ve created a stack of tools and datasets, all rolled into a single SDK.
What is a Benchmark Methodology?
A benchmark methodology is a comprehensive evaluation framework that uses specialized evaluators to assess educational content quality across multiple dimensions. These methodologies define the evaluation approach and criteria, while benchmark results represent the actual scores and outcomes from running these evaluations.
Key Characteristics
- Target system agnostic - designed to work with any educational content generator
- Opinionated - Rooted strongly in the pillars of Incept, evaluates educational content according to a list of subject and grade specific dimensions
- Globally applicable - Designed to benchmark and evaluate learning content for common subjects in any country, while strongly incorporating localization and cultural nuances into the evaluation
Core Components
- Target System - InceptBench is meant to be system agnostic, and work against any K-12 educational content generation system
- Evaluators - For each subject and grade, InceptBench SDK offers an evaluator to grade the outputs from the target system
- EduBench - To offer diversity and an external baseline to compare with, InceptBench also bundles in the open source EduBench benchmark dataset and evaluator
API-Based Evaluation
Evaluate educational content using the InceptBench API:
curl -X POST "https://api.inceptapi.com/evaluate" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer INCEPT_API_KEY" \
-d @qs.json
Response includes:
- Evaluation scores for each question
- Success/failure status
- Detailed evaluator outputs
- Overall quality metrics
Benchmark Results
Results from evaluating various content generation systems using InceptBench v1.3.0.
| Benchmark | Final Score | TI Question QA | Reading QC | Answer Accuracy | Questions | Version |
|---|---|---|---|---|---|---|
| SAT Reading Generation | 0.663 (66%) | 0.619 | 0.712 | 98.7% | 82 (3 sets) | 1.1.5 |
| APLIT Quiz Generation | 0.661 (66%) | 0.832 | 0.612 | 68.3% | 60 (3 sets) | 1.1.7 |
| Reading Question QC | 0.735 (74%) | 0.850 | 0.654 | 92.2% | 51 (1 set) | 1.1.5 |
| Incept Multilingual | 0.828 (83%) | 0.784 | 0.821 | 92.4% | 291 (3 sets) | 1.1.7 |
| ELA Mock FE | 0.835 (83%) | 0.852 | 0.818 | 98.1% | 580 (4 sets) | 1.2.0 |