Turning Intent into Specifications: A Benchmark and an Interactive User-Assistant Agent
Abstract
Today's agents are highly effective at implementing well-scoped software design plans, but user intent is often vague and admits multiple equally valid solutions. In this paper, we introduce SpecBench, a new benchmark for evaluating an agent's ability to translate user intent into a structured, executable specification that aligns with user preferences. The agent is given access to past user conversations and may interact with the user for a fixed number of rounds to ask clarifying questions. We find that existing agents exhibit two extreme behaviors: they either (i) struggle to collaborate proactively with users, entering implementation mode too quickly while overestimating their understanding of user preferences, or (ii) exhaust their question budget by asking about every ambiguous design choice. To address this limitation, we introduce a user-assistant agent: Buddy. It follows a workflow inspired by classical morphological analysis, decomposing user intent into a structured space of design dimensions and candidate choices. It then creates simulated users to evaluate these choices, before engaging the real user to resolve remaining ambiguities and finalize the specification. By shifting the focus from execution to specification, SpecBench and Buddy emphasize agent-user collaboration (not just code generation) as a key frontier in future agent design.
SpecBench Pipeline
SpecBench consists of 50 software design tasks across diverse domains with 48 simulated user personas. It evaluates agents on two complementary tasks: (1) preference elicitation, where agents predict user design choices after asking clarification queries, and (2) collaborative spec drafting, where agents work with users to produce structured specification documents rated on coverage, precision, consistency, insight, and readability.
Buddy Agent
Buddy decomposes a spec sheet into independent design dimensions using morphological analysis, then explores alternatives along each axis. It constructs a user profile and conversation summary from prior interactions, deploys simulated users to resolve predictable decisions, and strategically queries the real user only on uncertain choices. The result is higher quality specs with fewer user queries than existing agents.
Results
Buddy achieves higher quality specifications while querying users significantly fewer times than Claude Code, Gemini CLI, and Cursor CLI. Existing agents either ask too few questions (Gemini CLI) or too many without effective prioritization (Claude Code, Cursor CLI).
Citation
@article{wang2026specbench,
title={Turning Intent into Specifications: A Benchmark and an Interactive User-Assistant Agent},
author={Wang, Hao and Han, Ligong and Xu, Kai and Srivastava, Akash},
year={2026}
}