SpecBench

Turning Intent into Specifications: A Benchmark and an Interactive User-Assistant Agent

1Red Hat AI Innovation 2MIT-IBM Watson AI Lab 3IBM Core AI

Abstract

Today's agents are highly effective at implementing well-scoped software design plans, but user intent is often vague and admits multiple equally valid solutions. In this paper, we introduce SpecBench, a new benchmark for evaluating an agent's ability to translate user intent into a structured, executable specification that aligns with user preferences. The agent is given access to past user conversations and may interact with the user for a fixed number of rounds to ask clarifying questions. We find that existing agents exhibit two extreme behaviors: they either (i) struggle to collaborate proactively with users, entering implementation mode too quickly while overestimating their understanding of user preferences, or (ii) exhaust their question budget by asking about every ambiguous design choice. To address this limitation, we introduce a user-assistant agent: Buddy. It follows a workflow inspired by classical morphological analysis, decomposing user intent into a structured space of design dimensions and candidate choices. It then creates simulated users to evaluate these choices, before engaging the real user to resolve remaining ambiguities and finalize the specification. By shifting the focus from execution to specification, SpecBench and Buddy emphasize agent-user collaboration (not just code generation) as a key frontier in future agent design.

SpecBench Pipeline

SpecBench Pipeline Overview

SpecBench consists of 50 software design tasks across diverse domains with 48 simulated user personas. It evaluates agents on two complementary tasks: (1) preference elicitation, where agents predict user design choices after asking clarification queries, and (2) collaborative spec drafting, where agents work with users to produce structured specification documents rated on coverage, precision, consistency, insight, and readability.

Buddy Agent

Buddy Agent Architecture

Buddy decomposes a spec sheet into independent design dimensions using morphological analysis, then explores alternatives along each axis. It constructs a user profile and conversation summary from prior interactions, deploys simulated users to resolve predictable decisions, and strategically queries the real user only on uncertain choices. The result is higher quality specs with fewer user queries than existing agents.

Results

Preference Elicitation Results
Spec Sheet Quality Comparison

Buddy achieves higher quality specifications while querying users significantly fewer times than Claude Code, Gemini CLI, and Cursor CLI. Existing agents either ask too few questions (Gemini CLI) or too many without effective prioritization (Claude Code, Cursor CLI).

Citation

@article{wang2026specbench,
    title={Turning Intent into Specifications: A Benchmark and an Interactive User-Assistant Agent},
    author={Wang, Hao and Han, Ligong and Xu, Kai and Srivastava, Akash},
    year={2026}
}