Agentic Ai
Context Window Budget for FE Tasks
Reading level
The more you paste, the worse the answer
You have a bug on line 47 of a 400-line component. You select all, paste the entire file into the AI chat, and ask "what's wrong?" The model gives you a generic answer — or worse, starts suggesting a refactor. You try again with just the broken function and the error message. The model gives you an exact fix in one shot. Same model, same question, different context.
Context windows are not buckets you fill to the top for best results. The signal-to-noise ratio matters. The right context is surgical, not exhaustive.
AI models have a finite context window (32k–200k tokens depending on model). Every token you use on irrelevant file content is a token not available for chain-of-thought reasoning, tool output, or conversation history. More critically: retrieval quality degrades as the window fills with low-relevance content.
The "lost in the middle" effect (Liu et al., 2023) showed GPT-4's recall for facts dropped significantly when those facts were placed in the middle of long contexts. Relevant information buried in a file dump is harder to retrieve than the same information presented alone.
In agentic pipelines, context budget management is a first-class architectural concern. Multi-turn conversations, tool call histories, and retrieved documents all consume the window. Reserve headroom by design:
- System prompt: 2–8k tokens (fixed)
- Retrieved context: 8–20k tokens (dynamic, RAG output)
- Conversation history: grows per turn — summarize or truncate old turns
- Chain-of-thought reasoning: 2–8k for complex tasks
- Tool outputs: variable — truncate long outputs
For single-shot code tasks, the principle is simpler: send the minimum sufficient context. Error + stack trace + relevant function + interface types = correct answer 80% of the time.
What to include — and what to leave out
Always include:
- The exact error message and stack trace
- The function or component where the error occurs (not the whole file)
- What you expected to happen
- What you tried
Usually skip:
- Full package.json or tsconfig.json (unless the question is about config)
- Unrelated components, utilities, or styles
- Comments and documentation inside the pasted code
- Boilerplate and imports that aren't relevant to the error
// ❌ Low-signal context: 400-line file dump
// Model has to search for the bug in noise
// ✅ High-signal context: targeted excerpt
// Error: "Cannot read properties of undefined (reading 'map')"
// Stack: ProductList.tsx:47
function ProductList({ products }) {
return (
<ul>
{products.map(p => <li key={p.id}>{p.name}</li>)} {/* line 47 */}
</ul>
);
}
// Called from: Dashboard.tsx — products comes from useProducts() hook
// useProducts returns: { data, isLoading, error } from React Query
// Question: why is products undefined on first render?
This context is 12 lines — enough for the model to diagnose that products is undefined before React Query resolves, and suggest products ?? [] as a default.
For agentic tasks (not single-shot debugging), context assembly should be programmatic:
- Embedding search — retrieve the N most-relevant chunks by cosine similarity to the task description
- Re-ranking — a second pass (cross-encoder or LLM-based) to reorder retrieved chunks by actual relevance
- Summarization — compress long files to summaries before including them (file structure + exported API surface, not implementation)
- Instruction position — put the most important instructions at the start or end of the context, not in the middle, to exploit positional recall
The cost of context is not just tokens: it's latency (more tokens = slower TTFT) and quality (signal dilution). Design for minimum sufficient context as a hard constraint.
10,000 lines for a 5-line bug
Marcus had a bug: a React component was crashing on undefined data. He copied his entire 10,000-line codebase — every component, every utility, every config file — pasted it into a single AI prompt, and asked "what's wrong?" The model gave a generic answer about null checks. It wasn't wrong, exactly, but it wasn't specific to his component either. He tried adding more context. Still generic. The bug shipped to staging.
A colleague looked over his shoulder, selected the 15 lines of the broken function, the error message, and the hook that called it — and pasted just that. The model identified the exact issue (missing default on a destructured prop) in one sentence. Same model, three minutes later.
What went wrong technically: the model's 10,000-line paste used most of the context window before the actual question even started. The relevant function was on line 4,847 — roughly in the middle of the context. The "lost in the middle" effect (Liu et al., 2023) shows that retrieval accuracy for facts placed in the middle of long contexts drops significantly compared to facts placed at the start or end. The bug was there — but positionally disadvantaged. The model's attention was spread across thousands of unrelated lines.
Systemic impact: in a team of 10, if every developer is habitually pasting full files into AI sessions, you are burning collective context budget on noise at scale. More practically: large context pastes push the actual task content toward the middle of the window, reduce chain-of-thought headroom, and increase latency (TTFT scales with input token count). The result is systematically worse answers that take longer to arrive — a compounding tax on every AI-assisted debugging session.
Surgical context, exact answer
After the staging incident, Marcus changed his approach. For the same class of bug, he now pastes: the exact error message and stack trace, the specific function that throws, the TypeScript types it depends on, and one example of how it's called. That's usually 15–30 lines and a type definition. The answers are specific, actionable, and correct. Response time dropped from 8 seconds to under 2.
The rule he tells his team: "If you can't explain why every line you're pasting is necessary to answer the question, remove it."
The curated context for his bug report looked like this: error message (3 lines), the crashing component function (12 lines), the TypeScript interface for its props (6 lines), the one hook call that feeds it (4 lines), and the question. Total: ~180 tokens vs ~7,800 tokens for the full file dump. The model answered correctly on the first attempt. The delta wasn't model capability — it was signal clarity. Relevant information at the start of a short context outperforms the same information buried in the middle of a large one, every time.
Measurement angle: instrument your AI-assisted workflows with token counts (use tiktoken or the model API's usage response field). Track average input tokens per session and answer quality (proxied by "accepted first suggestion" rate). Teams that audit context discipline typically find a 5–10× difference in average input tokens between high-signal and low-signal prompters, with no corresponding quality improvement for larger inputs. Budget context as a first-class metric alongside latency and cost.
Pattern at a glance
Annotated example: prompting for a React bug
❌ FULL FILE DUMP
// [10,000 lines of codebase]
// imports, configs, helpers,
// unrelated components...
// ... line 4847:
function ProductList({ products }) {
return products.map(p => ...);
}
~7,800 tokens — relevant function buried in middle; generic, non-specific answer
✅ FOCUSED SNIPPET
// Error: Cannot read properties of
// undefined (reading 'map')
// Stack: ProductList.tsx:47
function ProductList({ products }) {
return products.map(p => ...);
}
// Props type: { products: Product[] }
// Caller: <ProductList {...useProducts()}/>
~180 tokens — error + function + types + caller; exact diagnosis on first attempt
Try it: overloaded vs curated context
The "Overloaded" mode pastes a full 300-line component to ask about a 5-line bug. Watch the model's answer quality. The "Curated" mode sends the same question with just the relevant excerpt. Compare the specificity of the answers.
Count the tokens in each mode. The overloaded context uses ~2,400 tokens; the curated context uses ~180 tokens — 13× less. The answer quality is equal or better because the model isn't distracted by irrelevant code.
This simulates the retrieval degradation effect: in the overloaded context, the bug is on line 47 of 300. In a real 128k window, burying the relevant code in the middle of a large context produces exactly this answer quality difference.
Showing: Curated — targeted context
Counting tokens and chunking context programmatically
To count tokens before you send a prompt, use the tiktoken library (for OpenAI-family models) or the anthropic.messages.count_tokens API method. The key pitfall: whitespace and comments cost tokens too — stripping them from code snippets before pasting can reduce token count by 15–30% with no loss of meaning for the model.
// Count tokens before sending (Node.js, tiktoken)
import { encoding_for_model } from 'tiktoken';
const enc = encoding_for_model('gpt-4o');
const tokens = enc.encode(myPromptString).length;
console.log(`Token count: ${tokens}`);
Three chunking strategies for larger codebases in AI workflows:
- File-per-turn — send one file per message in a multi-turn conversation. The model builds understanding incrementally without window saturation.
- Semantic relevance chunking — embed all files, cosine-search against the task description, send only the top-N chunks. Works well for RAG-style code assistants.
- API surface only — for dependencies, send only exported function signatures and their JSDoc — not the full implementation. The model needs the contract, not the code.
// Tool call pattern: incremental context in agentic loop
const tools = [{
name: 'read_file',
description: 'Read a file by path. Call once per file needed.',
// Model requests only what it needs, on demand
}];
// Instead of: stuffing all files upfront
At scale, context assembly should be a pipeline stage, not a manual step. The production pattern: (1) embed all source files at build time, (2) on each task, semantic-search the index for relevant chunks, (3) re-rank with a cross-encoder or second LLM pass, (4) trim to a hard token budget (e.g. 8k for code context, reserving 4k for reasoning and output). Instrument usage.input_tokens from every API response to track actual spend vs budget. Track "accepted first suggestion" rate as a proxy for context quality — it should improve as context discipline improves. Tooling: LangSmith, Braintrust, or custom logging to a time-series store for per-session token metrics.
References
Remember
Key takeaways
-
Send the error, the relevant function, and what you expected — not the whole file. 20 lines beats 400 lines for debugging tasks.Context window quality degrades with noise. The signal-to-noise ratio of your context directly affects answer specificity — more is not better.In agentic pipelines, budget context by role: system prompt, retrieved docs, history, reasoning, tool output. Reserve headroom. Overfilling the window degrades recall on middle-positioned content.
-
Before pasting, ask: "does the model need this to answer my question?" If not, remove it.Use embedding search + re-ranking for programmatic context assembly in pipelines — don't stuff all relevant files, rank and trim to the top N chunks.Position matters: critical instructions and key facts belong at the start or end of the context window — not buried in the middle where recall degrades ("lost in the middle" effect).
Keep going
Finish this takeaway, then continue the track — Casey saved your spot locally.
Sign in with email to sync progress across devices (beta).
Inside the Casebook
New cases every few weeks — patterns from production UI engineering. Double opt-in, easy unsubscribe.
No spam. Unsubscribe anytime. Emails sent via Buttondown.
RSS feed