19  Chapter 18: Context Engineering for Production Agents

20 From Prompt Engineering to Context Engineering

Early LLM applications relied on prompt engineering: carefully crafted instructions for frozen models at inference time. While effective for single-turn tasks, this paradigm proves insufficient for agents, tool use, long-horizon reasoning, and production reliability.

Modern systems increasingly rely on context engineering: the systematic design, optimization, and evolution of all inputs to an LLM–prompts, demonstrations, memory, tools, and feedback–without modifying model weights.

Why this matters for production:

  • Distribution shift requires runtime adaptation

  • Multi-step agents need accumulated domain knowledge

  • No retraining cost or latency

  • Human-interpretable updates

  • Compatible with long-context infrastructure

21 Static Prompt Engineering: Core Techniques

21.1 Essential Patterns

Instruction prompting: Clear task specification, constraints, formatting requirements. Reduces ambiguity, enforces structure.

Few-shot prompting: Provide demonstrations of input \(\to\) output behavior. Induces task understanding via in-context learning.

Chain-of-Thought (CoT): Encourage intermediate reasoning steps before producing answer. Improves multi-step reasoning and compositional tasks.

Self-consistency: Sample multiple reasoning paths, aggregate answers. Reduces variance in reasoning-heavy tasks.

Decomposition: Break tasks into subtasks (plan \(\to\) execute \(\to\) verify). Improves controllability and debuggability.

Tool-aware prompting: Explicitly instruct when/how to call tools, APIs, retrievers. Enables grounding, action, environment interaction.

21.2 Fundamental Limitations

  • - No learning from failures

  • - Brittle to distribution shift

  • - Cannot accumulate domain heuristics

  • - Single-string bottleneck

  • - Difficult to evaluate rigorously

These limitations motivate closed-loop optimization.

22 Closed-Loop Prompt Optimization

22.1 Core Reframing

Instead of asking: “What is the best prompt?”

We ask: “How should the prompt change given what just failed?”

Treat prompts as learnable artifacts, refined using feedback from execution rather than manual editing.

22.2 GEPA: Reflective Prompt Optimization

Generalized Preference Alignment (GEPA) formalizes closed-loop optimization using natural language feedback instead of scalar rewards.

Algorithm:

  1. Execute system on tasks

  2. Collect execution traces (reasoning, tool calls, errors)

  3. Use language-level reflection to diagnose failures

  4. Propose prompt updates

  5. Maintain Pareto frontier of candidate prompts

Key insight: Natural language feedback achieves high sample efficiency, outperforming RL in many settings.

Empirical gains:

  • Orders of magnitude fewer rollouts than RL

  • Language-level credit assignment

  • Modular compatibility with agents and RAG

22.3 Production Frameworks

DSPy:

  • Declarative LLM programs

  • Optimizers: GEPA, MIPROv2

  • Explicit module separation (retrievers, generators, critics)

  • Regression testing over prompt versions

Production-ready features:

  • Trace-based evaluation

  • Offline optimization with frozen deployment artifacts

  • Reproducible and auditable

22.4 Structural Limitation: Context Collapse

Despite strengths, prompt-only optimization suffers from:

Brevity bias: Iterative rewriting converges toward short, generic prompts

Context collapse: Rich contexts (18k tokens) compress to minimal instructions (100 tokens), losing critical domain detail

Performance degradation: Accuracy can drop below baseline with no context

This is not a bug–it is a structural failure mode of monolithic prompt rewriting.

23 Agentic Context Engineering (ACE)

23.1 Core Reframing

Contexts are not concise prompts. They are evolving playbooks.

Instead of rewriting a single prompt, ACE:

  • Accumulates reusable strategies

  • Preserves failure modes

  • Organizes domain knowledge explicitly

  • Updates context incrementally during execution

23.2 Agentic Architecture

ACE decomposes context adaptation into three roles:

Generator: Executes tasks in environment

Reflector: Extracts lessons from success/failure traces

Curator: Integrates structured updates into context

This separation:

  • Avoids overload on single model

  • Improves signal quality

  • Mirrors human learning loops

23.3 Delta-Based Context Updates

A central ACE innovation is incremental context evolution.

Instead of: “Rewrite the whole prompt”

ACE performs:

  • Localized bullet-level updates

  • Explicit tracking of helpful vs harmful rules

  • Deterministic merging (non-LLM)

  • Grow-and-refine vs rewrite-and-collapse

Benefits:

  • Scalability to long contexts

  • Resistance to collapse

  • Lower latency and cost

  • Interpretability

Empirical results:

  • 10–17% improvement over GEPA on agent tasks

  • Matches GPT-4 agents using smaller open models

  • Rich contexts (18k tokens) outperform compressed versions

Note

Why Long Contexts Work

ACE demonstrates that LLMs benefit from rich, saturated contexts and can self-select relevance at inference time. Modern KV cache reuse means longer contexts do not imply proportional serving cost.

Key principle: Do not compress knowledge prematurely–let the model filter.

24 Evaluation Design for Agents

24.1 Why Traditional LLM Eval Fails

Problems with single-turn accuracy:

  • Ignores trajectories

  • Textual plausibility \(\neq\) correctness

  • No notion of environment interaction

Agents must be evaluated on behavior, not prose.

24.2 Environment-Based Evaluation: AppWorld

AppWorld evaluates agents in executable environments with realistic APIs.

Key metrics:

  • Task Goal Completion (TGC): End-to-end success

  • Scenario Goal Completion (SGC): Satisfaction of all constraints

Critical properties:

  • Environment-based ground truth

  • Tool correctness over reasoning fluency

  • Multi-step evaluation

Lesson: Evaluation design drives architecture.

24.3 Building AppWorld-Style Eval In-House

You do not need AppWorld to apply its principles.

24.3.1 1. Wrap Actions in Verifiable Systems

Every agent action should yield:

  • Structured outputs

  • Success/failure signals

  • Invariant checks

Examples:

  • API status + schema validation

  • Database assertions

  • Unit tests for generated code

24.3.2 2. Define Task-Level Success Criteria

Replace “good answer” with:

  • Invariants

  • Constraints

  • Completion conditions

This enables binary, auditable evaluation.

24.3.3 3. Log Full Trajectories

Store:

  • Tool calls

  • Intermediate states

  • Environment feedback

  • Final outcomes

These traces power debugging, optimization, and context adaptation.

24.3.4 4. Separate Execution, Evaluation, and Adaptation

Avoid monolithic agents that act, judge, and rewrite context simultaneously.

Use agentic separation: Generator / Reflector / Curator.

24.3.5 5. Track Production-Relevant Metrics

Production KPIs:

  • Task success rate

  • Recovery from failure

  • Steps to completion

  • Cost per successful task

  • Regression across releases

Avoid: Token-level or stylistic metrics.

25 Comparison: Context Engineering vs Alternatives

Method Adaptation Cost Interpretable Sample Eff. Persistent
Static prompt None None High N/A No
GEPA Closed-loop Low High High No
ACE Online Low High Very High Yes
RL (PPO/DPO) Offline Very High Low Low Yes
Fine-tuning Offline High Low Medium Yes

Key insight: Context engineering achieves adaptation without weight updates–fast, interpretable, and sample-efficient.

26 Design Principles for Production

26.1 Do Not Compress Knowledge Prematurely

Rich, itemized contexts outperform concise instructions. Let the model filter relevance.

26.2 Preserve Domain Heuristics Explicitly

Use structured formats:

  • Bulleted strategies

  • Failure mode catalog

  • Tool usage constraints

26.3 Separate Execution, Reflection, and Curation

Avoid monolithic loops. Use modular architecture with clear responsibilities.

26.4 Treat Context as State, Not Text

Contexts evolve over time–version control, regression testing, and rollback are essential.

26.5 Engineer Feedback Channels as Carefully as Prompts

Quality of reflection depends on quality of traces. Instrument environment interactions.

27 Interview Summary

Note

Context Engineering: One-Paragraph Synthesis

Context engineering extends prompt engineering to production agents. Static techniques (few-shot, CoT, self-consistency) establish baselines but fail under distribution shift. Closed-loop optimization (GEPA) treats prompts as learnable using execution feedback, achieving sample-efficient improvement via natural language reflection. Agentic Context Engineering (ACE) generalizes this to structured, persistent contexts that evolve incrementally via Generator-Reflector-Curator architecture, avoiding context collapse while accumulating domain knowledge. Proper evaluation requires environment-grounded metrics (AppWorld-style task completion) rather than textual quality, implemented via verifiable actions, trajectory logging, and production-relevant KPIs. This approach enables runtime adaptation without retraining–fast, interpretable, and cost-efficient.

Note

Key Interview Points

  • Evolution: Static prompts \(\to\) closed-loop optimization \(\to\) agentic context engineering

  • GEPA: Uses language feedback instead of scalar rewards for prompt optimization

  • ACE: Incremental, structured updates prevent context collapse

  • Production framework: DSPy provides declarative programs with optimizers

  • Evaluation: Environment-based (AppWorld) over textual plausibility

  • Trade-off: Context engineering is faster and more interpretable than fine-tuning, but less persistent than weight updates

28 Key References

Note

Primary Sources

  • GEPA: Khattab et al. (2024). “Generalized Preference Alignment as Prompt Optimization.” arXiv:2507.19457 – Closed-loop prompt optimization using natural language feedback

  • ACE: Zhang et al. (2024). “Agentic Context Engineering: Online Context Adaptation for LLM Agents.” – Generator-Reflector-Curator architecture with delta-based updates

  • AppWorld: Trivedi et al. (2024). “AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents.” – Environment-based evaluation with TGC/SGC metrics

  • DSPy Framework: Khattab et al. – Declarative LLM programs with optimizers (GEPA, MIPROv2)