19 Chapter 18: Context Engineering for Production Agents
20 From Prompt Engineering to Context Engineering
Early LLM applications relied on prompt engineering: carefully crafted instructions for frozen models at inference time. While effective for single-turn tasks, this paradigm proves insufficient for agents, tool use, long-horizon reasoning, and production reliability.
Modern systems increasingly rely on context engineering: the systematic design, optimization, and evolution of all inputs to an LLM–prompts, demonstrations, memory, tools, and feedback–without modifying model weights.
Why this matters for production:
Distribution shift requires runtime adaptation
Multi-step agents need accumulated domain knowledge
No retraining cost or latency
Human-interpretable updates
Compatible with long-context infrastructure
21 Static Prompt Engineering: Core Techniques
21.1 Essential Patterns
Instruction prompting: Clear task specification, constraints, formatting requirements. Reduces ambiguity, enforces structure.
Few-shot prompting: Provide demonstrations of input \(\to\) output behavior. Induces task understanding via in-context learning.
Chain-of-Thought (CoT): Encourage intermediate reasoning steps before producing answer. Improves multi-step reasoning and compositional tasks.
Self-consistency: Sample multiple reasoning paths, aggregate answers. Reduces variance in reasoning-heavy tasks.
Decomposition: Break tasks into subtasks (plan \(\to\) execute \(\to\) verify). Improves controllability and debuggability.
Tool-aware prompting: Explicitly instruct when/how to call tools, APIs, retrievers. Enables grounding, action, environment interaction.
21.2 Fundamental Limitations
- No learning from failures
- Brittle to distribution shift
- Cannot accumulate domain heuristics
- Single-string bottleneck
- Difficult to evaluate rigorously
These limitations motivate closed-loop optimization.
22 Closed-Loop Prompt Optimization
22.1 Core Reframing
Instead of asking: “What is the best prompt?”
We ask: “How should the prompt change given what just failed?”
Treat prompts as learnable artifacts, refined using feedback from execution rather than manual editing.
22.2 GEPA: Reflective Prompt Optimization
Generalized Preference Alignment (GEPA) formalizes closed-loop optimization using natural language feedback instead of scalar rewards.
Algorithm:
Execute system on tasks
Collect execution traces (reasoning, tool calls, errors)
Use language-level reflection to diagnose failures
Propose prompt updates
Maintain Pareto frontier of candidate prompts
Key insight: Natural language feedback achieves high sample efficiency, outperforming RL in many settings.
Empirical gains:
Orders of magnitude fewer rollouts than RL
Language-level credit assignment
Modular compatibility with agents and RAG
22.3 Production Frameworks
DSPy:
Declarative LLM programs
Optimizers: GEPA, MIPROv2
Explicit module separation (retrievers, generators, critics)
Regression testing over prompt versions
Production-ready features:
Trace-based evaluation
Offline optimization with frozen deployment artifacts
Reproducible and auditable
22.4 Structural Limitation: Context Collapse
Despite strengths, prompt-only optimization suffers from:
Brevity bias: Iterative rewriting converges toward short, generic prompts
Context collapse: Rich contexts (18k tokens) compress to minimal instructions (100 tokens), losing critical domain detail
Performance degradation: Accuracy can drop below baseline with no context
This is not a bug–it is a structural failure mode of monolithic prompt rewriting.
23 Agentic Context Engineering (ACE)
23.1 Core Reframing
Contexts are not concise prompts. They are evolving playbooks.
Instead of rewriting a single prompt, ACE:
Accumulates reusable strategies
Preserves failure modes
Organizes domain knowledge explicitly
Updates context incrementally during execution
23.2 Agentic Architecture
ACE decomposes context adaptation into three roles:
Generator: Executes tasks in environment
Reflector: Extracts lessons from success/failure traces
Curator: Integrates structured updates into context
This separation:
Avoids overload on single model
Improves signal quality
Mirrors human learning loops
23.3 Delta-Based Context Updates
A central ACE innovation is incremental context evolution.
Instead of: “Rewrite the whole prompt”
ACE performs:
Localized bullet-level updates
Explicit tracking of helpful vs harmful rules
Deterministic merging (non-LLM)
Grow-and-refine vs rewrite-and-collapse
Benefits:
Scalability to long contexts
Resistance to collapse
Lower latency and cost
Interpretability
Empirical results:
10–17% improvement over GEPA on agent tasks
Matches GPT-4 agents using smaller open models
Rich contexts (18k tokens) outperform compressed versions
Why Long Contexts Work
ACE demonstrates that LLMs benefit from rich, saturated contexts and can self-select relevance at inference time. Modern KV cache reuse means longer contexts do not imply proportional serving cost.
Key principle: Do not compress knowledge prematurely–let the model filter.
24 Evaluation Design for Agents
24.1 Why Traditional LLM Eval Fails
Problems with single-turn accuracy:
Ignores trajectories
Textual plausibility \(\neq\) correctness
No notion of environment interaction
Agents must be evaluated on behavior, not prose.
24.2 Environment-Based Evaluation: AppWorld
AppWorld evaluates agents in executable environments with realistic APIs.
Key metrics:
Task Goal Completion (TGC): End-to-end success
Scenario Goal Completion (SGC): Satisfaction of all constraints
Critical properties:
Environment-based ground truth
Tool correctness over reasoning fluency
Multi-step evaluation
Lesson: Evaluation design drives architecture.
24.3 Building AppWorld-Style Eval In-House
You do not need AppWorld to apply its principles.
24.3.1 1. Wrap Actions in Verifiable Systems
Every agent action should yield:
Structured outputs
Success/failure signals
Invariant checks
Examples:
API status + schema validation
Database assertions
Unit tests for generated code
24.3.2 2. Define Task-Level Success Criteria
Replace “good answer” with:
Invariants
Constraints
Completion conditions
This enables binary, auditable evaluation.
24.3.3 3. Log Full Trajectories
Store:
Tool calls
Intermediate states
Environment feedback
Final outcomes
These traces power debugging, optimization, and context adaptation.
24.3.4 4. Separate Execution, Evaluation, and Adaptation
Avoid monolithic agents that act, judge, and rewrite context simultaneously.
Use agentic separation: Generator / Reflector / Curator.
24.3.5 5. Track Production-Relevant Metrics
Production KPIs:
Task success rate
Recovery from failure
Steps to completion
Cost per successful task
Regression across releases
Avoid: Token-level or stylistic metrics.
25 Comparison: Context Engineering vs Alternatives
| Method | Adaptation | Cost | Interpretable | Sample Eff. | Persistent |
|---|---|---|---|---|---|
| Static prompt | None | None | High | N/A | No |
| GEPA | Closed-loop | Low | High | High | No |
| ACE | Online | Low | High | Very High | Yes |
| RL (PPO/DPO) | Offline | Very High | Low | Low | Yes |
| Fine-tuning | Offline | High | Low | Medium | Yes |
Key insight: Context engineering achieves adaptation without weight updates–fast, interpretable, and sample-efficient.
26 Design Principles for Production
26.1 Do Not Compress Knowledge Prematurely
Rich, itemized contexts outperform concise instructions. Let the model filter relevance.
26.2 Preserve Domain Heuristics Explicitly
Use structured formats:
Bulleted strategies
Failure mode catalog
Tool usage constraints
26.3 Separate Execution, Reflection, and Curation
Avoid monolithic loops. Use modular architecture with clear responsibilities.
26.4 Treat Context as State, Not Text
Contexts evolve over time–version control, regression testing, and rollback are essential.
26.5 Engineer Feedback Channels as Carefully as Prompts
Quality of reflection depends on quality of traces. Instrument environment interactions.
27 Interview Summary
Context Engineering: One-Paragraph Synthesis
Context engineering extends prompt engineering to production agents. Static techniques (few-shot, CoT, self-consistency) establish baselines but fail under distribution shift. Closed-loop optimization (GEPA) treats prompts as learnable using execution feedback, achieving sample-efficient improvement via natural language reflection. Agentic Context Engineering (ACE) generalizes this to structured, persistent contexts that evolve incrementally via Generator-Reflector-Curator architecture, avoiding context collapse while accumulating domain knowledge. Proper evaluation requires environment-grounded metrics (AppWorld-style task completion) rather than textual quality, implemented via verifiable actions, trajectory logging, and production-relevant KPIs. This approach enables runtime adaptation without retraining–fast, interpretable, and cost-efficient.
Key Interview Points
Evolution: Static prompts \(\to\) closed-loop optimization \(\to\) agentic context engineering
GEPA: Uses language feedback instead of scalar rewards for prompt optimization
ACE: Incremental, structured updates prevent context collapse
Production framework: DSPy provides declarative programs with optimizers
Evaluation: Environment-based (AppWorld) over textual plausibility
Trade-off: Context engineering is faster and more interpretable than fine-tuning, but less persistent than weight updates
28 Key References
Primary Sources
GEPA: Khattab et al. (2024). “Generalized Preference Alignment as Prompt Optimization.” arXiv:2507.19457 – Closed-loop prompt optimization using natural language feedback
ACE: Zhang et al. (2024). “Agentic Context Engineering: Online Context Adaptation for LLM Agents.” – Generator-Reflector-Curator architecture with delta-based updates
AppWorld: Trivedi et al. (2024). “AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents.” – Environment-based evaluation with TGC/SGC metrics
DSPy Framework: Khattab et al. – Declarative LLM programs with optimizers (GEPA, MIPROv2)