Research · April 2026 · 6 min read
How prompt structure changes model behaviour.
Three findings from the prompting literature, plus one recent survey that catalogues everything since. We summarise what each measured, what it implies for the prompts you write, and where the evidence stops.
01 · The cost of vague
Vague prompts make the model guess.
Every round-trip to a language model costs time and tokens. When the input is ambiguous, the model fills in the gaps with whatever the average request looks like — not yours. The fix isn't a longer back-and-forth. It's a better first message.
Here's the difference, made visible. Same model, same task. The prompts on the left of each pair were both used as the entire input. The outputs on the right are what came back.
Write a marketing email for our new feature.
Model output
Generic, padded, ends with a non-CTA. Twenty minutes from a usable email.
You are a B2B SaaS copywriter who specialises in product launches.
Context: We are launching a collaborative prompt library for teams.
The audience is growth marketers at 50–500 person software companies.
They care about saving time and repeatable results.
Task: Write a launch announcement email for our new shared prompt library.
Constraints:
- Subject line under 50 characters
- Body under 180 words
- No jargon ("synergy", "leverage", "game-changer")
- End with exactly one CTA linking to the product page
Output format: Plain text. Subject line first, then body.Model output
Specific subject, clear audience, ends with one CTA. Send-ready.
Outputs are illustrative — produced by Claude Sonnet on default settings. Yours will differ.
The structured version takes about thirty seconds longer to write. It eliminates two to four follow-up messages. On any task you do more than once — every task, eventually — that math gets very good very fast.
02 · Anatomy
Five things every good prompt has.
Structure isn't about being long for its own sake. It's about removing whole categories of failure before the model ever sees the question. You don't need every block every time — but you should know which one you're leaving out, and why.
- 01
Role / persona
Sets the model's frame of reference and the knowledge it draws on.
"You are a senior TypeScript engineer reviewing a pull request."
Without a role, the model defaults to a generic helpful assistant. With one, it draws on a much narrower slice of its training and writes accordingly.
- 02
Context
The facts the model needs to avoid guessing them itself.
"The codebase uses strict ESLint and targets Node 20."
Missing context is the most common cause of wrong-but-confident answers. Spell out the facts so the model doesn't fill in plausible-sounding ones.
- 03
Task
A single unambiguous instruction. One task per prompt.
"Refactor the function below to eliminate the nested conditionals."
Two tasks in one prompt usually means one of them gets a half-effort answer. Split them. Ask one thing at a time, well.
- 04
Constraints
Boundaries that prevent unwanted outputs before they happen.
"Do not change the function signature or return type."
Constraints upfront beat corrections after the fact. Every "don't" you list saves one round-trip you would have spent fixing it.
- 05
Output format
Tells the model exactly what shape the answer should take.
"Return only the refactored function — no explanation."
If you don't say what you want back, you'll get markdown when you wanted JSON, paragraphs when you wanted bullets, and a polite preface either way.
Hover any row to see why the block matters.
03 · The research
The numbers don't lie.
Three findings spanning the foundational papers (2022) through the more recent reasoning techniques (2023). The 2024 Prompt Report — a systematic survey from a coalition of researchers across Maryland, OpenAI, Stanford, Microsoft, and others — catalogues 58 distinct prompting techniques across the literature, and the same patterns hold across all of them.
Chain-of-thought
GSM8K accuracy on PaLM 540B
Asking the model to “think step by step” jumped accuracy from 17.9% to 56.9% on grade-school math.
Wei et al., 2022 [1]
Tree of Thoughts
Game of 24 success rate, GPT-4
Letting the model explore and prune multiple reasoning paths takes a task it solves 4% of the time and pushes it to 74%.
Yao et al., 2023 [2]
Self-consistency
GSM8K — CoT vs. CoT + sampling
Sampling several chain-of-thought attempts and taking the majority answer adds another 17 points on top.
Wang et al., 2022 [3]
These aren't toy effects, and they aren't old news. The same patterns show up across the reasoning, coding, and knowledge benchmarks Anthropic, OpenAI, and DeepMind have continued to publish through 2024 and into 2025 — and increasingly across structured-output techniques like Anthropic's XML-tag prompting (typically 15-20% accuracy lifts on Claude over plain text), DSPy-style program synthesis, and ReAct-style tool use. Structure remains the cheapest performance lever you have access to: a minute of writing, and a model that performs as if it were a tier larger than it actually is.
Sources
- [1]Wei et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arxiv.org
- [2]Yao et al. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arxiv.org
- [3]Wang et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. arxiv.org
- [4]Schulhoff et al. (2024). The Prompt Report: A Systematic Survey of Prompt Engineering Techniques. arxiv.org
04 · A real prompt, dissected
What it looks like in the wild.
Every block from the previous section, in a single prompt you could drop into your workflow today. The callouts on the right map each block to the lines it covers.
You are a UX researcher who specialises in B2B software. # Context I conducted 6 customer interviews this week. Each transcript is pasted below, separated by "---". Participants were asked about their current prompt workflows and pain points. # Task Summarise the interviews into a concise research brief. # Constraints - Max 400 words total - Group findings by theme, not by participant - Exclude anything mentioned only once (not a pattern) - Flag any direct quotes worth keeping verbatim # Output format Markdown. Top-level heading per theme. Bullet points for findings. Quotes in > blockquotes. End with a "Key tensions" section. --- [TRANSCRIPT 1] …
Role
Line 1
Context
Lines 3–7
Constraints
Lines 12–16
Output format
Lines 18–20
The task itself is the shortest part. That's usually a sign you got the rest right.
What this means for you
Design the prompt once. Run it forever.
The prompts worth designing aren't the one-offs — those are faster to type fresh than to engineer. The ones worth designing are the ones you'll run again next Tuesday, and the Tuesday after that. Two minutes of structure today saves thirty seconds × the next fifty runs. Plus you don't have to remember the few-shot examples in week three. Plus your teammate doesn't have to reinvent it.
That's what Pergamum is for. It's a free, open library for the prompts that earn their keep. Every prompt has its variables broken out as fillable inputs, every one is tagged for the model it was tuned on, and every one is yours to copy, fork, and remix.