InceptBench U.S. Reading Comprehension

A comprehensive benchmark methodology for validating reading comprehension assessment items aligned with U.S. K-12 Reading standards and Common Core State Standards (CCSS).

The QC pipeline evaluates existing reading comprehension questions (MCQ and matching pair formats) to ensure they meet rigorous quality standards across distractor quality, standard alignment, passage accuracy, and explanation quality through a comprehensive 10-11 check pipeline.

Quality Control Pipeline

A comprehensive quality control system for reading comprehension assessment items (MCQ and MP question types) that validates questions before they reach students.

Pipeline Architecture

The QC pipeline evaluates questions through two sequential stages:

Input: CSV with questions (± explanations)

Stage 1: Question Quality Control

Distractor Checks (5-6):
- Grammatical Parallelism
- Plausibility
- Homogeneity
- Specificity Balance
- Too-Close Detection
- Length Balance (MCQ only)
Question Checks (5):
- Standard Alignment
- Clarity & Precision
- Single Correct Answer
- Passage Reference Accuracy
- Difficulty Assessment

Stage 2: Explanation Quality Control (if explanations present)

Correctness checks (3)
Distractor checks (6)
Universal checks (3)

Output: Comprehensive JSON results + Summary report

Question Quality Control

Evaluates reading comprehension questions across 10-11 quality checks, divided into distractor validation and question validation.

Distractor Checks (5-6)

1. Grammatical Parallelism

Ensures consistent grammatical structure across all answer options
All options must follow the same grammatical pattern (e.g., all complete sentences, all noun phrases)

2. Plausibility

All incorrect options (distractors) must be believable to students who haven’t mastered the skill
Avoids obviously wrong or implausible distractors

3. Homogeneity

All options belong to the same conceptual category
Prevents “odd one out” scenarios where one option is fundamentally different

4. Specificity Balance

Similar detail levels across all options
Prevents correct answer from being notably more specific or vague than distractors

5. Too-Close Detection

No distractors semantically too similar to the correct answer
Prevents unintended ambiguity where multiple answers could be considered correct

6. Length Balance (MCQ only)

Word counts appropriately balanced across options
Prevents patterns where correct answer is consistently longest/shortest

Question Checks (5)

7. Standard Alignment

Question assesses the assigned Common Core State Standard (CCSS)
Content and cognitive demands match the standard’s requirements

8. Clarity & Precision

Clear, unambiguous wording with no vague language
Grade-appropriate vocabulary and sentence structure

9. Single Correct Answer

Exactly one defensible correct answer
No ambiguity that could justify alternative answers

10. Passage Reference Accuracy

All references to the reading passage are valid and accurate
Questions accurately reflect passage content

11. Difficulty Assessment

Question difficulty is appropriate for the target grade level
Assessed against benchmark questions from the same grade

Explanation Quality Control

Validates explanations for both correct answers and distractors across 9-12 checks.

For Correct Answers (6 checks)

1. Correctness Explanation

Clearly explains why the answer is correct
Connects to the reading passage evidence

2. Textual Evidence

Provides specific references to passage content
Shows where in the text the answer is supported

3. Skill Reinforcement

Reinforces the reading comprehension skill being assessed
Helps students understand the thinking process

4. Tone

Supportive and encouraging tone appropriate for the grade level
Avoids condescension or overly technical language

5. Conciseness

Clear and to the point without unnecessary verbosity
Respects student attention span

6. Grade Appropriateness

Vocabulary and complexity match target grade level
Developmentally appropriate explanations

For Distractors (9 checks)

1. Specific Error Identification

Identifies the specific error or misconception that led to the incorrect choice
Not just “this is wrong” but explains the thinking error

2. Misconception Diagnosis

Diagnoses the underlying misconception or reasoning flaw
Helps students understand their thinking process

3. Textual Refutation

Uses passage evidence to explain why the option is incorrect
Shows where the reasoning breaks down

4. Correct Guidance

Redirects students toward correct understanding
Provides a path to the right answer without giving it away directly

5. Actionable Strategy

Offers a concrete strategy students can use to avoid similar errors
Teaches transferable comprehension skills

6. Reasoning Model

Models expert reading comprehension thinking
Shows how proficient readers approach the question

7-9. Tone, Conciseness, Grade Appropriateness

Same criteria as correct answer explanations

API Requirements

Anthropic Claude API:

Required for question QC
Runs most checks (parallelism, plausibility, homogeneity, specificity, standard alignment, clarity, single correct answer, passage reference)

OpenAI API:

Required for explanation QC
Optional for question QC (enables too-close detection and difficulty assessment)

Performance:

Question QC: ~5-10 questions/minute
Explanation QC: ~20-40 options/minute
Sequential processing ensures consistency

Output Files

The pipeline generates comprehensive results for each QC run:

Question QC Results (question_qc_YYYYMMDD_HHMMSS.json)

Per-question QC results with individual check scores and responses
Overall quality score per question
Total checks passed/run

Explanation QC Results (explanation_qc_YYYYMMDD_HHMMSS.json)

Per-explanation QC results for each answer option
Detailed feedback on correct answer and distractor explanations

Summary Report (summary_report.json)

Consolidated statistics across all questions
Question QC pass rates and average scores
Explanation QC pass rates and average scores

Benchmark Results

Model	Question QC Pass Rate	Explanation QC Pass Rate	Avg Quality Score	Latency
Coming Soon	-	-	-	-