19 Chapter 18: Context Engineering for Production Agents

20 From Prompt Engineering to Context Engineering

Early LLM applications relied on prompt engineering: carefully crafted instructions for frozen models at inference time. While effective for single-turn tasks, this paradigm proves insufficient for agents, tool use, long-horizon reasoning, and production reliability.

Modern systems increasingly rely on context engineering: the systematic design, optimization, and evolution of all inputs to an LLM–prompts, demonstrations, memory, tools, and feedback–without modifying model weights.

Why this matters for production:

Distribution shift requires runtime adaptation
Multi-step agents need accumulated domain knowledge
No retraining cost or latency
Human-interpretable updates
Compatible with long-context infrastructure

21 Static Prompt Engineering: Core Techniques

21.1 Essential Patterns

Instruction prompting: Clear task specification, constraints, formatting requirements. Reduces ambiguity, enforces structure.

Few-shot prompting: Provide demonstrations of input \(\to\) output behavior. Induces task understanding via in-context learning.

Chain-of-Thought (CoT): Encourage intermediate reasoning steps before producing answer. Improves multi-step reasoning and compositional tasks.

Self-consistency: Sample multiple reasoning paths, aggregate answers. Reduces variance in reasoning-heavy tasks.

Decomposition: Break tasks into subtasks (plan \(\to\) execute \(\to\) verify). Improves controllability and debuggability.

Tool-aware prompting: Explicitly instruct when/how to call tools, APIs, retrievers. Enables grounding, action, environment interaction.

21.2 Fundamental Limitations

- No learning from failures
- Brittle to distribution shift
- Cannot accumulate domain heuristics
- Single-string bottleneck
- Difficult to evaluate rigorously

These limitations motivate closed-loop optimization.

22 Closed-Loop Prompt Optimization

22.1 Core Reframing

Instead of asking: “What is the best prompt?”

We ask: “How should the prompt change given what just failed?”

Treat prompts as learnable artifacts, refined using feedback from execution rather than manual editing.

22.2 GEPA: Reflective Prompt Optimization

Generalized Preference Alignment (GEPA) formalizes closed-loop optimization using natural language feedback instead of scalar rewards.

Algorithm:

Execute system on tasks
Collect execution traces (reasoning, tool calls, errors)
Use language-level reflection to diagnose failures
Propose prompt updates
Maintain Pareto frontier of candidate prompts

Key insight: Natural language feedback achieves high sample efficiency, outperforming RL in many settings.

Empirical gains:

Orders of magnitude fewer rollouts than RL
Language-level credit assignment
Modular compatibility with agents and RAG

22.3 Production Frameworks

DSPy:

Declarative LLM programs
Optimizers: GEPA, MIPROv2
Explicit module separation (retrievers, generators, critics)
Regression testing over prompt versions

Production-ready features:

Trace-based evaluation
Offline optimization with frozen deployment artifacts
Reproducible and auditable

22.4 Structural Limitation: Context Collapse

Despite strengths, prompt-only optimization suffers from:

Brevity bias: Iterative rewriting converges toward short, generic prompts

Context collapse: Rich contexts (18k tokens) compress to minimal instructions (100 tokens), losing critical domain detail

Performance degradation: Accuracy can drop below baseline with no context

This is not a bug–it is a structural failure mode of monolithic prompt rewriting.

23 Agentic Context Engineering (ACE)

23.1 Core Reframing

Contexts are not concise prompts. They are evolving playbooks.

Instead of rewriting a single prompt, ACE:

Accumulates reusable strategies
Preserves failure modes
Organizes domain knowledge explicitly
Updates context incrementally during execution

23.2 Agentic Architecture

ACE decomposes context adaptation into three roles:

Generator: Executes tasks in environment

Reflector: Extracts lessons from success/failure traces

Curator: Integrates structured updates into context

This separation:

Avoids overload on single model
Improves signal quality
Mirrors human learning loops

23.3 Delta-Based Context Updates

A central ACE innovation is incremental context evolution.

Instead of: “Rewrite the whole prompt”

ACE performs:

Localized bullet-level updates
Explicit tracking of helpful vs harmful rules
Deterministic merging (non-LLM)
Grow-and-refine vs rewrite-and-collapse

Benefits:

Scalability to long contexts
Resistance to collapse
Lower latency and cost
Interpretability

Empirical results:

10–17% improvement over GEPA on agent tasks
Matches GPT-4 agents using smaller open models
Rich contexts (18k tokens) outperform compressed versions

Note

Why Long Contexts Work

ACE demonstrates that LLMs benefit from rich, saturated contexts and can self-select relevance at inference time. Modern KV cache reuse means longer contexts do not imply proportional serving cost.

Key principle: Do not compress knowledge prematurely–let the model filter.

24 Evaluation Design for Agents

24.1 Why Traditional LLM Eval Fails

Problems with single-turn accuracy:

Ignores trajectories
Textual plausibility \(\neq\) correctness
No notion of environment interaction

Agents must be evaluated on behavior, not prose.

24.2 Environment-Based Evaluation: AppWorld

AppWorld evaluates agents in executable environments with realistic APIs.

Key metrics:

Task Goal Completion (TGC): End-to-end success
Scenario Goal Completion (SGC): Satisfaction of all constraints

Critical properties:

Environment-based ground truth
Tool correctness over reasoning fluency
Multi-step evaluation

Lesson: Evaluation design drives architecture.

24.3 Building AppWorld-Style Eval In-House

You do not need AppWorld to apply its principles.

24.3.1 1. Wrap Actions in Verifiable Systems

Every agent action should yield:

Structured outputs
Success/failure signals
Invariant checks

Examples:

API status + schema validation
Database assertions
Unit tests for generated code

24.3.2 2. Define Task-Level Success Criteria

Replace “good answer” with:

Invariants
Constraints
Completion conditions

This enables binary, auditable evaluation.

24.3.3 3. Log Full Trajectories

Store:

Tool calls
Intermediate states
Environment feedback
Final outcomes

These traces power debugging, optimization, and context adaptation.

24.3.4 4. Separate Execution, Evaluation, and Adaptation

Avoid monolithic agents that act, judge, and rewrite context simultaneously.

Use agentic separation: Generator / Reflector / Curator.

24.3.5 5. Track Production-Relevant Metrics

Production KPIs:

Task success rate
Recovery from failure
Steps to completion
Cost per successful task
Regression across releases

Avoid: Token-level or stylistic metrics.

25 Comparison: Context Engineering vs Alternatives

Method	Adaptation	Cost	Interpretable	Sample Eff.	Persistent
Static prompt	None	None	High	N/A	No
GEPA	Closed-loop	Low	High	High	No
ACE	Online	Low	High	Very High	Yes
RL (PPO/DPO)	Offline	Very High	Low	Low	Yes
Fine-tuning	Offline	High	Low	Medium	Yes

Key insight: Context engineering achieves adaptation without weight updates–fast, interpretable, and sample-efficient.

26 Design Principles for Production

26.1 Do Not Compress Knowledge Prematurely

Rich, itemized contexts outperform concise instructions. Let the model filter relevance.

26.2 Preserve Domain Heuristics Explicitly

Use structured formats:

Bulleted strategies
Failure mode catalog
Tool usage constraints

26.3 Separate Execution, Reflection, and Curation

Avoid monolithic loops. Use modular architecture with clear responsibilities.

26.4 Treat Context as State, Not Text

Contexts evolve over time–version control, regression testing, and rollback are essential.

26.5 Engineer Feedback Channels as Carefully as Prompts

Quality of reflection depends on quality of traces. Instrument environment interactions.

27 Interview Summary

Note

Context Engineering: One-Paragraph Synthesis

Context engineering extends prompt engineering to production agents. Static techniques (few-shot, CoT, self-consistency) establish baselines but fail under distribution shift. Closed-loop optimization (GEPA) treats prompts as learnable using execution feedback, achieving sample-efficient improvement via natural language reflection. Agentic Context Engineering (ACE) generalizes this to structured, persistent contexts that evolve incrementally via Generator-Reflector-Curator architecture, avoiding context collapse while accumulating domain knowledge. Proper evaluation requires environment-grounded metrics (AppWorld-style task completion) rather than textual quality, implemented via verifiable actions, trajectory logging, and production-relevant KPIs. This approach enables runtime adaptation without retraining–fast, interpretable, and cost-efficient.

Note

Key Interview Points

Evolution: Static prompts \(\to\) closed-loop optimization \(\to\) agentic context engineering
GEPA: Uses language feedback instead of scalar rewards for prompt optimization
ACE: Incremental, structured updates prevent context collapse
Production framework: DSPy provides declarative programs with optimizers
Evaluation: Environment-based (AppWorld) over textual plausibility
Trade-off: Context engineering is faster and more interpretable than fine-tuning, but less persistent than weight updates

28 Key References

Note

Primary Sources

GEPA: Khattab et al. (2024). “Generalized Preference Alignment as Prompt Optimization.” arXiv:2507.19457 – Closed-loop prompt optimization using natural language feedback
ACE: Zhang et al. (2024). “Agentic Context Engineering: Online Context Adaptation for LLM Agents.” – Generator-Reflector-Curator architecture with delta-based updates
AppWorld: Trivedi et al. (2024). “AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents.” – Environment-based evaluation with TGC/SGC metrics
DSPy Framework: Khattab et al. – Declarative LLM programs with optimizers (GEPA, MIPROv2)