Table of Contents
- Introduction: The New Paradigm of LLM Programming
- Core Philosophy: Why DSPy Changes Everything
- Getting Started: Your First DSPy Program
- Deep Dive: Signatures - The Foundation of DSPy
- Modules: Building Blocks for AI Systems
- Optimization: Where DSPy Shines
- Guardrails and Reliability: DSPy Assertions
- Advanced Patterns and Techniques
- Production Deployment Patterns
- Common Pitfalls and How to Avoid Them
- Real-World Case Studies
- Troubleshooting and Debugging
- Expert-Level Best Practices
- Conclusion: Mastering DSPy
- Further Reading
🚀 TL;DR: What is DSPy?
What: DSPy is Stanford's framework that replaces manual prompt engineering with automatic prompt optimization.
Why: Stop spending hours crafting prompts that break when models change. Let DSPy generate and optimize them for you.
How: Define what you want (signatures), choose reasoning strategies (modules), and let optimizers improve performance automatically.
Result: 50%+ performance improvements, model portability, and self-improving AI systems that get better with more data.
✅ Perfect for: Teams building production AI systems, scaling beyond prototypes, or tired of prompt brittleness.
Quick Comparison: Traditional Prompting vs DSPy
| Aspect | Traditional Prompting | DSPy |
|---|---|---|
| Development Time | Hours/days per prompt | Minutes to define task |
| Model Portability | Rewrite for each model | Works across all LLMs |
| Performance Optimization | Manual trial & error | Automatic optimization |
| Consistency | Varies between runs | Predictable outputs |
| Composability | Difficult to combine | Modular & chainable |
| Learning Curve | Easy start, hard mastery | Moderate start, systematic mastery |
| Production Readiness | Fragile at scale | Built for production |
Introduction: The New Paradigm of LLM Programming
Traditional prompt engineering has become the bottleneck of modern AI development. Hours spent crafting the perfect prompt, only to have it break when switching models or when requirements change. Enter DSPy (Declarative Self-improving Language Programs) - Stanford's revolutionary framework that transforms how we build AI applications.
DSPy represents a fundamental shift from prompting to programming language models. Instead of manually crafting prompts, you define what you want to achieve, and DSPy automatically generates and optimizes the prompts for you. Think of it as the difference between assembly language and modern programming languages - DSPy provides the abstraction layer that makes LLM development scalable and maintainable.
By the end of this comprehensive guide, you'll understand not just how to use DSPy, but how to think in DSPy - moving from brittle, handcrafted prompts to robust, self-optimizing AI systems that improve themselves over time.
Core Philosophy: Why DSPy Changes Everything
The DSPy philosophy can be summarized as: "There will be better strategies, optimizations, and models tomorrow. Don't be dependent on any one". This approach decouples your task definition from specific LLMs, prompting techniques, and optimization strategies, making your code future-proof and portable.
DSPy Architecture Overview
The Problem with Traditional Prompting
Traditional prompt engineering suffers from several critical issues that make it unsustainable for production applications:
- Brittleness: Prompts break when models change or update
- Time-consuming: Manual tuning takes hours or days per prompt
- Non-portable: Prompts optimized for GPT-4 fail on Claude or Llama
- Inconsistent: Results vary significantly across runs
- Maintenance nightmare: Every model update requires prompt rewrites
- No composability: Can't easily combine prompts into larger systems
The DSPy Solution
DSPy addresses these problems through three core abstractions that transform how we build with LLMs:
- Signatures: Declarative task specifications that define input/output contracts without implementation details
- Modules: Composable building blocks with different reasoning strategies (Chain-of-Thought, ReAct, etc.)
- Optimizers: Automatic prompt improvement algorithms that learn from examples
This separation of concerns enables you to write AI applications that automatically adapt to new models, improve with more data, and maintain consistent performance across different deployment scenarios.
Getting Started: Your First DSPy Program
Installation and Setup
Before diving into code, let's set up our environment. DSPy works with any LLM provider - OpenAI, Anthropic, Cohere, or even local models through Ollama. The beauty of DSPy is that you can develop locally with free models and seamlessly switch to production models later without changing your code. This flexibility alone can save teams thousands of dollars during development.
# Install DSPy
pip install dspy-ai
# For local models (optional but recommended for development)
brew install ollama # MacOS
# or
curl -fsSL https://ollama.ai/install.sh | sh # Linux
# Pull a local model
ollama pull llama3
ollama serveWhy use Ollama for local development? It gives you a GPT-3.5 equivalent model running entirely on your machine - no API costs, no rate limits, and complete privacy. Perfect for experimenting and testing your DSPy programs before deploying with commercial models.
Hello World Example
Let's build our first DSPy program - a math question answering system. This example might look simple, but it showcases three revolutionary concepts: declarative task definition, automatic prompt generation, and reasoning transparency. Here's what makes each line special:
import dspy
# Configure the language model
dspy.configure(lm=dspy.LM('openai/gpt-4o-mini'))
class MathQA(dspy.Module):
def __init__(self):
super().__init__()
# Define the module using Chain-of-Thought reasoning
self.solve = dspy.ChainOfThought("question -> answer: float")
def forward(self, question: str):
return self.solve(question=question)
# Instantiate and invoke the module
qa = MathQA()
result = qa("What is 3 * 7 + 2?")
print(result.answer) # Output: 23.0
# Inspect the reasoning process
print(result.reasoning) # Shows the step-by-step thinkingLet's break down what just happened in those few lines of code:
- Line 4: We configured DSPy with a model. Switch to
dspy.LM('anthropic/claude-3')ordspy.LM('ollama/llama3')- your code stays the same! - Line 9: The signature
"question -> answer: float"tells DSPy what goes in and what comes out. The: floattype hint ensures we get a numeric answer. - Line 9:
ChainOfThoughtadds step-by-step reasoning automatically - no need to write "let's think step by step" in your prompts! - Line 15: When we call our module, DSPy generates, optimizes, and executes the prompt behind the scenes.
This simple example demonstrates DSPy's core principle: you define the signature ("question -> answer: float") and choose a reasoning strategy (ChainOfThought), while DSPy handles the prompt generation automatically. No manual prompt crafting required!
Understanding the Magic
What's truly remarkable is what DSPy does behind the scenes. It doesn't just template your inputs - it generates sophisticated prompts with reasoning chains, format instructions, and type validation. Let's peek under the hood to see the actual prompt DSPy created:
# See the actual prompt DSPy generated
dspy.inspect_history(n=1)
# Output shows something like:
# System: You are answering questions. Think step by step.
# User: question: What is 3 * 7 + 2?
#
# Let me work through this step-by-step:
# 1. First, I need to multiply 3 * 7 = 21
# 2. Then add 2 to get 21 + 2 = 23
#
# answer: 23.0Deep Dive: Signatures - The Foundation of DSPy
Signatures are DSPy's way of defining input/output behavior without specifying implementation details. They act as contracts that any module implementing them must fulfill. Think of them as type hints for AI - they tell the system what goes in and what should come out.
Why Signatures Matter
In traditional prompting, you might write: "Given the context '{context}', answer the question '{question}' with a short response." But what happens when you need to:
- Change the format of the response?
- Add validation for the output?
- Switch to a different model that expects different formatting?
- Compose this with other prompts?
With signatures, you define the interface once, and DSPy handles all these concerns automatically. It's the difference between hardcoding and using an abstraction layer.
String-based Signatures
String signatures are the quickest way to get started. They use a simple arrow notation that's intuitive and readable. Here's how they work and when to use each pattern:
# Basic question answering
"question -> answer"
# Sentiment analysis with type specification
"sentence -> sentiment: bool"
# Multi-input example
"context, question -> answer"
# Multiple outputs
"document -> summary, keywords, sentiment"
# Creative examples showing flexibility
"baseball_player -> affiliated_team"
"novella -> tldr"
"code -> documentation"
"symptoms -> possible_diagnoses"Each signature pattern above serves a specific purpose:
- Simple I/O (
"question -> answer"): Perfect for straightforward transformations - Type hints (
": bool",": list[str]"): DSPy will validate and coerce outputs to match your type - Multiple inputs (
"context, question -> answer"): Common for RAG systems where you need both context and query - Multiple outputs (
"-> summary, keywords, sentiment"): Get structured data back in one call - Domain-specific (
"symptoms -> possible_diagnoses"): The naming itself provides semantic hints to the model
The arrow (->) separates inputs from outputs. You can specify types after a colon, and DSPy will ensure the output matches that type. This type safety is crucial for production systems where you need predictable outputs.
Class-based Signatures
While string signatures are great for prototyping, class-based signatures are what you'll use in production. They provide rich metadata that DSPy uses to generate more accurate and reliable prompts. Here's why they're powerful:
Benefits of Class Signatures
- ✓ Detailed field descriptions
- ✓ Default values
- ✓ Complex type validation
- ✓ Better prompt generation
- ✓ Self-documenting code
When to Use
- • Production systems
- • Complex data structures
- • When accuracy matters
- • Team collaboration
- • API interfaces
class QA(dspy.Signature):
"""Answer questions based on provided context."""
context: str = dspy.InputField(
desc="Background information that may contain the answer"
)
question: str = dspy.InputField(
desc="The question to be answered"
)
answer: str = dspy.OutputField(
desc="A concise, accurate answer based on the context"
)
class CodeGeneration(dspy.Signature):
"""Generate Python code to solve the given problem."""
problem_description: str = dspy.InputField(
desc="Clear description of the problem to solve"
)
requirements: list[str] = dspy.InputField(
desc="List of specific requirements or constraints",
default_factory=list
)
code: str = dspy.OutputField(
desc="Complete, working Python code with comments"
)
explanation: str = dspy.OutputField(
desc="Brief explanation of the approach taken"
)Pro Tip
The quality of your descriptions directly impacts the quality of DSPy's generated prompts. Be specific and clear about what you expect. Good descriptions lead to 30-40% better performance!
Signature Design Patterns
Here are some powerful patterns for designing effective signatures:
# Pattern 1: Structured Output
class StructuredAnalysis(dspy.Signature):
"""Analyze text and return structured data."""
text: str = dspy.InputField()
entities: list[dict] = dspy.OutputField(
desc="List of {name, type, confidence} dictionaries"
)
# Pattern 2: Conditional Logic
class ConditionalResponse(dspy.Signature):
"""Provide different responses based on input type."""
query: str = dspy.InputField()
query_type: str = dspy.OutputField(
desc="One of: factual, opinion, action"
)
response: str = dspy.OutputField(
desc="Appropriate response based on query type"
)
# Pattern 3: Multi-step Processing
class MultiStepAnalysis(dspy.Signature):
"""Complex analysis with intermediate steps."""
raw_data: str = dspy.InputField()
cleaned_data: str = dspy.OutputField(
desc="Data after cleaning and normalization"
)
insights: list[str] = dspy.OutputField(
desc="Key insights extracted from the data"
)
recommendations: str = dspy.OutputField(
desc="Actionable recommendations based on insights"
)Modules: Building Blocks for AI Systems
Modules implement different reasoning strategies and can be composed like neural network layers. Each module represents a different way of thinking about problems.
The Module Philosophy
Think of modules like specialized AI agents, each with its own reasoning style. Just as you might approach a math problem differently than a creative writing task, DSPy modules provide different cognitive strategies:
- → Predict: Direct answers, like asking a knowledgeable friend
- → ChainOfThought: Step-by-step reasoning, like a teacher showing their work
- → ProgramOfThought: Code generation, like a programmer solving algorithmically
- → ReAct: Tool use and reasoning, like a researcher with resources
The beauty? You can swap modules without changing your code structure. Start with Predict, upgrade to ChainOfThought when you need reasoning transparency, all without refactoring.
Core Modules
dspy.Predict: Basic Prompting
The simplest module - it takes your signature and generates a straightforward prompt. Use this when you need quick, direct answers without intermediate reasoning steps. It's perfect for simple classifications, extractions, or when speed matters more than explainability.
# Simplest module - direct prompting without special strategies
predictor = dspy.Predict("question -> answer")
result = predictor(question="What is the capital of France?")
print(result.answer) # "Paris"Notice how clean this is? No prompt templates, no "You are a helpful assistant..." preambles. DSPy handles all of that based on your signature. The model knows it needs to return an "answer" field because that's what you specified.
dspy.ChainOfThought: Step-by-Step Reasoning
ChainOfThought is probably the most important module you'll use. It automatically adds reasoning steps before the final answer, dramatically improving accuracy on complex tasks. Research shows this can improve performance by 20-50% on reasoning tasks. Here's the magic:
# Adds "let's think step by step" reasoning
cot = dspy.ChainOfThought("question -> answer")
result = cot(question="If I have 3 apples and buy 2 more, then give away 1, how many do I have?")
print(result.reasoning) # Shows step-by-step calculation
print(result.answer) # "4"What's happening here? DSPy automatically injected a "reasoning" field into your output, even though your signature only specified "answer". The module generates intermediate thinking steps, making the model's logic transparent and debuggable. This is invaluable for:
- Debugging why the model gave a certain answer
- Building trust with users who can see the reasoning
- Improving accuracy through structured thinking
- Training data generation (the reasoning becomes part of your dataset!)
dspy.ProgramOfThought: Code Generation for Problem Solving
Here's where things get really interesting. ProgramOfThought doesn't just think about problems - it writes and executes actual Python code to solve them. This is revolutionary for mathematical, algorithmic, or data processing tasks. The model becomes a programmer:
# Generates and executes code to solve problems
pot = dspy.ProgramOfThought("question -> answer")
result = pot(question="What is the sum of all prime numbers between 1 and 20?")
# Internally generates and runs code like:
# def find_primes(n):
# primes = []
# for num in range(2, n+1):
# if all(num % i != 0 for i in range(2, int(num**0.5) + 1)):
# primes.append(num)
# return sum(primes)dspy.ReAct: Reasoning and Acting with Tools
# Define tools the model can use
def web_search(query: str) -> str:
"""Search the web for information."""
# Your search implementation
return f"Results for {query}..."
def calculator(expression: str) -> float:
"""Evaluate mathematical expressions."""
return eval(expression) # Dont use eval in production!
# Create ReAct agent with tools
agent = dspy.ReAct("question -> answer", tools=[web_search, calculator])
result = agent(question="What is the population of Tokyo multiplied by 2?")
# The agent will:
# 1. Search for Tokyo's population
# 2. Use calculator to multiply by 2
# 3. Return the final answerBuilding Complex Pipelines
The real power comes from composing modules into sophisticated AI systems:
class AdvancedRAG(dspy.Module):
def __init__(self, k=5):
super().__init__()
# Multiple retrieval strategies
self.keyword_retrieve = dspy.Retrieve(k=k)
self.semantic_retrieve = dspy.Retrieve(k=k, similarity="cosine")
# Query expansion for better retrieval
self.expand_query = dspy.ChainOfThought(
"question -> expanded_queries: list[str]"
)
# Answer generation with reasoning
self.generate = dspy.ChainOfThought(
"context, question -> answer, confidence: float"
)
# Self-verification
self.verify = dspy.Predict(
"question, answer -> is_correct: bool, explanation"
)
def forward(self, question):
# Expand the query for better retrieval
expanded = self.expand_query(question=question)
# Retrieve from multiple sources
keyword_docs = self.keyword_retrieve(question)
semantic_docs = self.semantic_retrieve(
expanded.expanded_queries[0] if expanded.expanded_queries else question
)
# Combine and deduplicate contexts
all_contexts = list(set(keyword_docs + semantic_docs))
context = "\n".join(all_contexts[:5])
# Generate answer with confidence
answer = self.generate(context=context, question=question)
# Self-verify the answer
verification = self.verify(
question=question,
answer=answer.answer
)
# Return comprehensive result
return dspy.Prediction(
answer=answer.answer,
confidence=answer.confidence,
verified=verification.is_correct,
explanation=verification.explanation,
sources=all_contexts[:3]
)Key Insight
Modules are completely swappable. You can replace ChainOfThought with ProgramOfThought or any other reasoning strategy without changing the rest of your pipeline. This modularity is what makes DSPy so powerful for experimentation and optimization.
Optimization: Where DSPy Shines
The real magic of DSPy lies in its optimizers - algorithms that automatically improve your prompts and few-shot examples based on your specific data and metrics. This is where DSPy transforms from a nice abstraction to a game-changing framework.
How DSPy Optimization Works
Understanding Optimization
The Optimization Magic
Here's what makes DSPy optimization revolutionary: instead of you manually tweaking prompts for hours, DSPy automatically:
- 1. Generates multiple prompt variations
- 2. Tests them against your data
- 3. Selects the best-performing examples
- 4. Creates optimized instructions
- 5. Builds few-shot demonstrations
Result? Performance improvements of 20-68% are common, with some tasks seeing 2-3x better accuracy. All automatically.
DSPy optimizers take three inputs and produce an optimized version of your program:
- Your DSPy program: Single module or complex pipeline
- Your metric: Function that scores outputs (higher is better)
- Training examples: Can be small - even 5-10 examples work!
Core Optimizers
BootstrapFewShot: Learning from Examples
BootstrapFewShot is your go-to optimizer for most tasks. It's fast, reliable, and works with minimal data. The name tells you what it does: it "bootstraps" (automatically generates) few-shot examples from your training data. Here's how to use it:
from dspy.teleprompt import BootstrapFewShot
# Define your metric - this determines what "good" means
def accuracy_metric(gold, pred, trace=None):
# Check if the answer is correct (case-insensitive)
return gold.answer.lower() == pred.answer.lower()
# Prepare training examples
train_examples = [
dspy.Example(
question="What is 2+2?",
answer="4"
).with_inputs("question"),
dspy.Example(
question="What is the capital of France?",
answer="Paris"
).with_inputs("question"),
# Add more examples...
]
# Set up the optimizer
teleprompter = BootstrapFewShot(
metric=accuracy_metric,
max_bootstrapped_demos=4, # How many examples to include in prompt
max_labeled_demos=4, # Max hand-labeled examples to use
max_errors=5 # Stop after this many failed attempts
)
# Optimize your program
base_program = MathQA()
optimized_program = teleprompter.compile(
base_program,
trainset=train_examples
)
# The optimized program now includes few-shot examples!
result = optimized_program("What is 5 * 6?")
print(f"Answer: {result.answer}")What Just Happened?
The optimizer just transformed your simple program into a sophisticated system with:
- Few-shot examples: Automatically selected the best examples from your training data
- Optimized prompts: Generated instructions that work best for your specific task
- Error handling: Learned from failures to avoid common mistakes
Your original 5-line program now performs like a carefully hand-tuned system that would have taken days to create manually.
MIPROv2: State-of-the-Art Optimization
MIPROv2 (Multi-Instruction Proposal Optimization) is DSPy's flagship optimizer. It uses advanced techniques from Bayesian optimization to find the absolute best prompts for your task. While BootstrapFewShot selects examples, MIPROv2 goes further - it writes custom instructions, optimizes prompt structure, and fine-tunes every aspect of the prompt. Here's the power unleashed:
from dspy.teleprompt import MIPROv2
# Define a more sophisticated metric
def advanced_metric(gold, pred, trace=None):
"""Multi-faceted evaluation metric."""
# Accuracy component
accuracy = 1.0 if gold.answer.lower() in pred.answer.lower() else 0.0
# Length penalty (prefer concise answers)
length_score = min(1.0, 50 / len(pred.answer.split()))
# Check if reasoning is provided (for ChainOfThought)
has_reasoning = 1.0 if hasattr(pred, 'reasoning') and pred.reasoning else 0.5
# Weighted combination
return 0.6 * accuracy + 0.2 * length_score + 0.2 * has_reasoning
# Configure MIPROv2
teleprompter = MIPROv2(
metric=advanced_metric,
auto="medium", # Optimization intensity: "light", "medium", or "heavy"
num_threads=4, # Parallel optimization threads
verbose=True, # Show optimization progress
track_stats=True # Track optimization statistics
)
# Run optimization with more control
optimized_program = teleprompter.compile(
program=base_program,
trainset=train_examples,
valset=validation_examples, # Optional validation set
max_bootstrapped_demos=4,
max_labeled_demos=4,
eval_kwargs={
'num_threads': 8,
'display_progress': True
}
)
# Save the optimized program
optimized_program.save("models/optimized_qa_v1.json")
# Load it later
loaded_program = dspy.Module.load("models/optimized_qa_v1.json")The auto parameter controls optimization intensity:
- light: Quick optimization (~5 minutes, good for development)
- medium: Balanced optimization (~20 minutes, recommended default)
- heavy: Extensive optimization (~1+ hours, for production)
Advanced Optimization Techniques
# Stage 1: Bootstrap basic examples
bootstrap = BootstrapFewShot(
metric=accuracy_metric,
max_bootstrapped_demos=8
)
stage1_program = bootstrap.compile(base_program, trainset=train[:50])
# Stage 2: Optimize with MIPRO on bootstrapped program
mipro = MIPROv2(
metric=advanced_metric,
auto="medium"
)
stage2_program = mipro.compile(stage1_program, trainset=train)
# Stage 3: Fine-tune with more data
finetuner = BootstrapFewShotWithRandomSearch(
metric=advanced_metric,
num_candidate_programs=10,
num_threads=4
)
final_program = finetuner.compile(stage2_program, trainset=full_train)
# Evaluate final performance
evaluator = dspy.evaluate.Evaluate(
devset=test_set,
metric=advanced_metric,
num_threads=8,
display_progress=True
)
results = evaluator(final_program)
print(f"Final accuracy: {results['metric']:.2%}")
print(f"Examples processed: {results['processed']}")
print(f"Average latency: {results['avg_latency']:.2f}s")Optimization Best Practices
Important: Cost Considerations
Optimization can be expensive! A single MIPROv2 heavy optimization can cost $5-50 depending on your dataset size. Always start with smaller datasets and cheaper models for initial experiments.
- Start Simple: Begin with BootstrapFewShot before moving to advanced optimizers
- Representative Data: Ensure training examples reflect real-world usage patterns
- Meaningful Metrics: Design metrics that capture actual business value, not just accuracy
- Separate Train/Val/Test: Use distinct datasets to avoid overfitting (60/20/20 split)
- Cost Management: Start optimization with smaller, cheaper models, then transfer to larger ones
- Iterate Gradually: Run multiple optimization rounds with increasing complexity
Guardrails and Reliability: DSPy Assertions
DSPy Assertions provide a sophisticated way to enforce constraints and guide model behavior. They act as guardrails that ensure your AI systems produce reliable, consistent outputs that meet your requirements.
Assert vs Suggest
DSPy provides two types of constraints with different enforcement levels:
- dspy.Assert: Hard constraints that halt execution if violated (use for critical requirements)
- dspy.Suggest: Soft constraints that encourage refinement but don't stop execution (use for preferences)
import dspy
import json
def is_valid_json(text):
"""Check if text is valid JSON."""
try:
json.loads(text)
return True
except:
return False
def has_required_keys(json_str, required_keys):
"""Check if JSON has required keys."""
try:
data = json.loads(json_str)
return all(key in data for key in required_keys)
except:
return False
class StructuredGenerator(dspy.Module):
def __init__(self):
super().__init__()
self.generate = dspy.ChainOfThought(
"topic -> json_response: str, summary: str"
)
def forward(self, topic):
# Generate initial response
response = self.generate(topic=topic)
# Hard constraint - MUST be valid JSON
dspy.Assert(
is_valid_json(response.json_response),
"Response must be valid JSON format. Please fix syntax errors.",
backtrack=self.generate # Retry this module if assertion fails
)
# Hard constraint - MUST have required fields
dspy.Assert(
has_required_keys(response.json_response, ['title', 'content', 'metadata']),
"JSON must include 'title', 'content', and 'metadata' fields"
)
# Soft constraint - SHOULD be concise
dspy.Suggest(
len(response.json_response) < 500,
"Consider making the response more concise (under 500 characters)"
)
# Soft constraint - SHOULD have good summary
dspy.Suggest(
len(response.summary.split()) >= 10,
"Summary should be at least 10 words for clarity"
)
return response
# Usage with automatic retry on assertion failures
generator = StructuredGenerator()
result = generator(topic="Benefits of exercise")
# DSPy will automatically retry if assertions fail,
# providing feedback to the model for self-refinementAdvanced Assertion Patterns
class SecureCodeGenerator(dspy.Module):
def __init__(self):
super().__init__()
self.generate_code = dspy.ChainOfThought(
"requirements -> code: str, explanation: str"
)
self.security_check = dspy.Predict(
"code -> has_vulnerabilities: bool, issues: list[str]"
)
def forward(self, requirements):
# Generate code
result = self.generate_code(requirements=requirements)
# Security validation
security = self.security_check(code=result.code)
# Multi-level assertions
dspy.Assert(
not security.has_vulnerabilities,
f"Security issues detected: {security.issues}. Please fix.",
backtrack=self.generate_code
)
# Code quality checks
dspy.Assert(
"eval(" not in result.code and "exec(" not in result.code,
"Code must not use eval() or exec() for security reasons"
)
dspy.Suggest(
result.code.count('\n') < 50,
"Consider breaking down the solution into smaller functions"
)
# Documentation check
dspy.Assert(
'"""' in result.code or "'''" in result.code or '#' in result.code,
"Code must include documentation (docstrings or comments)"
)
return result
class DataValidationPipeline(dspy.Module):
def __init__(self):
super().__init__()
self.extract = dspy.ChainOfThought(
"raw_text -> structured_data: dict"
)
self.validate = dspy.Predict(
"data -> is_valid: bool, errors: list[str]"
)
self.transform = dspy.ChainOfThought(
"data, errors -> corrected_data: dict"
)
def forward(self, raw_text):
# Extract structured data
extraction = self.extract(raw_text=raw_text)
# Validate extraction
validation = self.validate(data=extraction.structured_data)
# If invalid, attempt correction
if not validation.is_valid:
dspy.Suggest(
False,
f"Data validation issues: {validation.errors}. Attempting correction..."
)
# Transform with error feedback
corrected = self.transform(
data=extraction.structured_data,
errors=validation.errors
)
# Re-validate corrected data
final_validation = self.validate(data=corrected.corrected_data)
dspy.Assert(
final_validation.is_valid,
"Unable to produce valid data after correction attempt",
backtrack=self.extract # Start over from extraction
)
return corrected
return extractionCustom Backtracking Strategies
DSPy allows sophisticated backtracking strategies for handling assertion failures:
class AdaptiveRetryModule(dspy.Module):
def __init__(self, max_retries=3):
super().__init__()
self.max_retries = max_retries
self.attempt_count = 0
# Different strategies for different attempt numbers
self.simple_generate = dspy.Predict("input -> output")
self.cot_generate = dspy.ChainOfThought("input -> output")
self.react_generate = dspy.ReAct("input -> output")
def forward(self, input_text):
self.attempt_count += 1
# Escalate strategy based on attempt number
if self.attempt_count == 1:
result = self.simple_generate(input=input_text)
elif self.attempt_count == 2:
result = self.cot_generate(input=input_text)
else:
result = self.react_generate(input=input_text)
# Custom validation
is_valid = self.validate_output(result.output)
# Assert with custom backtracking
dspy.Assert(
is_valid or self.attempt_count >= self.max_retries,
f"Output validation failed (attempt {self.attempt_count}/{self.max_retries})",
backtrack=self if self.attempt_count < self.max_retries else None
)
return result
def validate_output(self, output):
# Your validation logic here
return len(output) > 10 and "error" not in output.lower()Advanced Patterns and Techniques
As you become more proficient with DSPy, these advanced patterns will help you build sophisticated, production-ready AI systems.
Multi-Agent Systems
class ResearchAgentSystem(dspy.Module):
def __init__(self):
super().__init__()
# Specialized agents for different tasks
self.researcher = ResearchAgent()
self.fact_checker = FactCheckAgent()
self.writer = WriterAgent()
self.editor = EditorAgent()
def forward(self, topic, style="academic"):
# Research phase
research = self.researcher(topic=topic)
# Fact-checking phase
verified_facts = self.fact_checker(
claims=research.findings,
sources=research.sources
)
# Writing phase
draft = self.writer(
facts=verified_facts.verified_claims,
style=style,
outline=research.outline
)
# Editing phase
final = self.editor(
draft=draft.content,
style_guide=style,
fact_sheet=verified_facts.verified_claims
)
return dspy.Prediction(
article=final.edited_content,
sources=research.sources,
fact_check_report=verified_facts.report,
confidence=final.confidence_score
)
class ResearchAgent(dspy.Module):
def __init__(self):
super().__init__()
self.search = dspy.ChainOfThought("topic -> search_queries: list[str]")
self.retrieve = dspy.Retrieve(k=10)
self.synthesize = dspy.ChainOfThought(
"sources, topic -> findings: list[str], outline: str"
)
def forward(self, topic):
# Generate diverse search queries
queries = self.search(topic=topic)
# Retrieve from multiple queries
all_sources = []
for query in queries.search_queries[:3]:
sources = self.retrieve(query)
all_sources.extend(sources)
# Deduplicate and synthesize
unique_sources = list(set(all_sources))
synthesis = self.synthesize(
sources="\n".join(unique_sources[:10]),
topic=topic
)
return dspy.Prediction(
findings=synthesis.findings,
outline=synthesis.outline,
sources=unique_sources[:5]
)Dynamic Module Selection
class AdaptiveQA(dspy.Module):
def __init__(self):
super().__init__()
# Classifier to determine question type
self.classifier = dspy.Predict(
"question -> question_type: str, complexity: str"
)
# Different modules for different question types
self.simple_qa = dspy.Predict("question -> answer")
self.math_qa = dspy.ProgramOfThought("question -> answer")
self.research_qa = ComplexRAG()
self.creative_qa = dspy.ChainOfThought(
"question -> creative_response, inspiration_sources"
)
def forward(self, question):
# Classify the question
classification = self.classifier(question=question)
# Route to appropriate module
if classification.question_type == "mathematical":
return self.math_qa(question=question)
elif classification.question_type == "factual":
if classification.complexity == "simple":
return self.simple_qa(question=question)
else:
return self.research_qa(question=question)
elif classification.question_type == "creative":
return self.creative_qa(question=question)
else:
# Default fallback
return dspy.ChainOfThought("question -> answer")(question=question)Ensemble Methods
class EnsemblePredictor(dspy.Module):
def __init__(self, programs: list):
super().__init__()
self.programs = programs
self.aggregator = dspy.ChainOfThought(
"predictions: list[str], confidence_scores: list[float] -> final_answer, explanation"
)
def forward(self, **kwargs):
# Collect predictions from all programs
predictions = []
confidence_scores = []
for program in self.programs:
try:
result = program(**kwargs)
predictions.append(result.answer)
# Extract confidence if available
confidence = getattr(result, 'confidence', 0.5)
confidence_scores.append(confidence)
except Exception as e:
print(f"Program failed: {e}")
continue
# Aggregate predictions intelligently
aggregated = self.aggregator(
predictions=predictions,
confidence_scores=confidence_scores
)
return dspy.Prediction(
answer=aggregated.final_answer,
explanation=aggregated.explanation,
individual_predictions=predictions,
confidence_scores=confidence_scores
)
# Train multiple programs with different configurations
def create_ensemble(base_program, train_data, n_models=3):
programs = []
for i in range(n_models):
# Different optimization strategies
if i == 0:
optimizer = BootstrapFewShot(metric=accuracy_metric)
elif i == 1:
optimizer = MIPROv2(metric=accuracy_metric, auto="light")
else:
optimizer = MIPROv2(metric=accuracy_metric, auto="medium")
# Train with different data samples
sample_size = int(len(train_data) * 0.8)
sample = random.sample(train_data, sample_size)
optimized = optimizer.compile(
base_program.deepcopy(),
trainset=sample
)
programs.append(optimized)
return EnsemblePredictor(programs)Custom Metrics with LLM Judges
class LLMJudge(dspy.Module):
def __init__(self, criteria: str):
super().__init__()
self.criteria = criteria
self.judge = dspy.ChainOfThought(
"""question, answer, criteria ->
score: float, strengths: list[str], weaknesses: list[str], suggestions: str"""
)
def forward(self, question, answer):
return self.judge(
question=question,
answer=answer,
criteria=self.criteria
)
def create_llm_metric(criteria: str):
"""Factory function to create LLM-based metrics."""
judge = LLMJudge(criteria)
def metric(gold, pred, trace=None):
# Use LLM to evaluate the prediction
evaluation = judge(
question=gold.question,
answer=pred.answer
)
# Return normalized score
return float(evaluation.score)
return metric
# Create specialized metrics
accuracy_metric = create_llm_metric(
"Rate accuracy from 0-1 based on factual correctness"
)
helpfulness_metric = create_llm_metric(
"Rate helpfulness from 0-1 based on how well it addresses the user's needs"
)
safety_metric = create_llm_metric(
"Rate safety from 0-1, checking for harmful or inappropriate content"
)
# Combine metrics
def combined_metric(gold, pred, trace=None):
acc = accuracy_metric(gold, pred, trace)
help = helpfulness_metric(gold, pred, trace)
safe = safety_metric(gold, pred, trace)
# Weighted combination with safety as a hard requirement
if safe < 0.8:
return 0.0 # Fail if not safe
return 0.6 * acc + 0.4 * helpProduction Deployment Patterns
Taking DSPy from development to production requires careful consideration of performance, reliability, and monitoring. Here are battle-tested patterns for production deployment.
Production Deployment Architecture
Caching and Performance Optimization
In production, every API call costs money and adds latency. DSPy's built-in caching is good, but for production systems serving thousands of requests, you need enterprise-grade caching. Here's how to implement a Redis-based caching layer that can handle millions of requests with sub-millisecond latency:
import dspy
from dspy.cache import Cache
import redis
import hashlib
import json
class RedisCache(Cache):
"""Production-grade Redis cache for DSPy."""
def __init__(self, redis_url="redis://localhost:6379", ttl=3600):
self.redis_client = redis.from_url(redis_url)
self.ttl = ttl
def get_key(self, *args, **kwargs):
"""Generate cache key from arguments."""
key_data = json.dumps({"args": args, "kwargs": kwargs}, sort_keys=True)
return f"dspy:cache:{hashlib.md5(key_data.encode()).hexdigest()}"
def get(self, *args, **kwargs):
"""Retrieve from cache."""
key = self.get_key(*args, **kwargs)
cached = self.redis_client.get(key)
if cached:
return json.loads(cached)
return None
def set(self, value, *args, **kwargs):
"""Store in cache."""
key = self.get_key(*args, **kwargs)
self.redis_client.setex(
key,
self.ttl,
json.dumps(value)
)
# Configure DSPy with production cache
cache = RedisCache(redis_url="redis://prod-redis:6379", ttl=7200)
dspy.configure(
lm=dspy.LM('openai/gpt-4'),
cache=cache
)
# Add request-level caching for APIs
from functools import lru_cache
from concurrent.futures import ThreadPoolExecutor
import asyncio
class CachedDSPyModule(dspy.Module):
def __init__(self, base_module, cache_size=128):
super().__init__()
self.base_module = base_module
self._cache = lru_cache(maxsize=cache_size)(self._cached_forward)
def _cached_forward(self, input_hash):
# Reconstruct input from hash
return self.base_module.forward(**json.loads(input_hash))
def forward(self, **kwargs):
# Create hashable input
input_hash = json.dumps(kwargs, sort_keys=True)
return self._cache(input_hash)Production Caching Benefits
This caching strategy provides:
- Cost Reduction: 90%+ reduction in API calls for repeated queries
- Latency: Sub-millisecond response times for cached results
- Scale: Redis can handle millions of cached entries
- TTL Control: Automatic cache expiration for fresh data
- Request-level caching: LRU cache for hot paths within single requests
Pro tip: Use different TTLs for different types of queries. Static facts can cache for days, while time-sensitive data might only cache for minutes.
FastAPI Deployment with Monitoring
Deploying DSPy in production isn't just about serving predictions - it's about observability, reliability, and performance. This FastAPI setup includes everything you need for production: health checks, metrics, error tracking, and graceful degradation. Let's build a production-ready service:
from fastapi import FastAPI, HTTPException, BackgroundTasks
from pydantic import BaseModel
import dspy
import time
import logging
from prometheus_client import Counter, Histogram, generate_latest
import sentry_sdk
# Initialize monitoring
sentry_sdk.init(dsn="your-sentry-dsn")
# Prometheus metrics
request_count = Counter('dspy_requests_total', 'Total requests')
request_duration = Histogram('dspy_request_duration_seconds', 'Request duration')
error_count = Counter('dspy_errors_total', 'Total errors')
app = FastAPI(title="DSPy Production Service")
# Load optimized program
program = dspy.Module.load("models/production_model_v1.json")
class PredictionRequest(BaseModel):
question: str
context: str = None
max_tokens: int = 500
temperature: float = 0.7
class PredictionResponse(BaseModel):
answer: str
confidence: float
sources: list[str] = []
latency_ms: float
model_version: str
@app.post("/predict", response_model=PredictionResponse)
async def predict(
request: PredictionRequest,
background_tasks: BackgroundTasks
):
"""Main prediction endpoint with monitoring."""
start_time = time.time()
request_count.inc()
try:
# Input validation
if len(request.question) > 1000:
raise HTTPException(400, "Question too long (max 1000 chars)")
# Run prediction with timeout
result = await asyncio.wait_for(
asyncio.to_thread(
program,
question=request.question,
context=request.context
),
timeout=30.0
)
# Calculate metrics
latency = (time.time() - start_time) * 1000
request_duration.observe(latency / 1000)
# Log for analysis
background_tasks.add_task(
log_prediction,
request=request,
response=result,
latency=latency
)
return PredictionResponse(
answer=result.answer,
confidence=getattr(result, 'confidence', 0.95),
sources=getattr(result, 'sources', []),
latency_ms=latency,
model_version="v1.2.0"
)
except asyncio.TimeoutError:
error_count.inc()
raise HTTPException(504, "Request timeout")
except Exception as e:
error_count.inc()
sentry_sdk.capture_exception(e)
raise HTTPException(500, f"Prediction failed: {str(e)}")
@app.get("/health")
async def health():
"""Health check endpoint."""
return {"status": "healthy", "model_loaded": program is not None}
@app.get("/metrics")
async def metrics():
"""Prometheus metrics endpoint."""
return generate_latest()
async def log_prediction(request, response, latency):
"""Background task for logging predictions."""
logging.info({
"event": "prediction",
"question_length": len(request.question),
"answer_length": len(response.answer),
"latency_ms": latency,
"confidence": response.confidence
})Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: dspy-service
labels:
app: dspy-service
spec:
replicas: 3
selector:
matchLabels:
app: dspy-service
template:
metadata:
labels:
app: dspy-service
spec:
containers:
- name: dspy-app
image: your-registry/dspy-service:v1.2.0
ports:
- containerPort: 8000
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: openai-secret
key: api-key
- name: REDIS_URL
value: "redis://redis-service:6379"
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "2000m"
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 5
periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
name: dspy-service
spec:
selector:
app: dspy-service
ports:
- port: 80
targetPort: 8000
type: LoadBalancer
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: dspy-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: dspy-service
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80MLflow Integration for Experiment Tracking
import mlflow
import mlflow.pyfunc
import dspy
from datetime import datetime
class DSPyMLflowWrapper(mlflow.pyfunc.PythonModel):
"""Wrapper to deploy DSPy models with MLflow."""
def load_context(self, context):
"""Load the DSPy program."""
self.program = dspy.Module.load(context.artifacts["dspy_model"])
# Configure DSPy
dspy.configure(
lm=dspy.LM('openai/gpt-4'),
cache=True
)
def predict(self, context, model_input):
"""Run predictions."""
predictions = []
for _, row in model_input.iterrows():
result = self.program(**row.to_dict())
predictions.append({
"answer": result.answer,
"confidence": getattr(result, 'confidence', 0.95)
})
return predictions
def train_and_log_model(train_data, test_data):
"""Complete MLflow training pipeline."""
mlflow.set_tracking_uri("http://mlflow-server:5000")
mlflow.set_experiment("DSPy Production Models")
with mlflow.start_run() as run:
# Log parameters
mlflow.log_param("optimizer", "MIPROv2")
mlflow.log_param("auto_mode", "medium")
mlflow.log_param("train_size", len(train_data))
mlflow.log_param("model_version", "v1.2.0")
# Train model
base_program = YourDSPyProgram()
optimizer = MIPROv2(
metric=your_metric,
auto="medium",
track_stats=True
)
optimized_program = optimizer.compile(
base_program,
trainset=train_data
)
# Evaluate
evaluator = dspy.evaluate.Evaluate(
devset=test_data,
metric=your_metric,
num_threads=8
)
results = evaluator(optimized_program)
# Log metrics
mlflow.log_metric("accuracy", results['metric'])
mlflow.log_metric("avg_latency_ms", results['avg_latency'] * 1000)
# Save and log model
model_path = f"models/model_{datetime.now():%Y%m%d_%H%M%S}.json"
optimized_program.save(model_path)
# Log with MLflow
mlflow.pyfunc.log_model(
artifact_path="dspy_model",
python_model=DSPyMLflowWrapper(),
artifacts={"dspy_model": model_path},
input_example={"question": "Sample question"},
signature=mlflow.models.infer_signature(
{"question": "Sample"},
{"answer": "Sample answer", "confidence": 0.95}
),
registered_model_name="dspy_production_model"
)
return run.info.run_id
# Deploy the model
def deploy_model(run_id):
"""Deploy model from MLflow."""
client = mlflow.tracking.MlflowClient()
# Transition to production
client.transition_model_version_stage(
name="dspy_production_model",
version=1,
stage="Production"
)
# Load for serving
model = mlflow.pyfunc.load_model(
f"models:/{model_name}/Production"
)
return modelCommon Pitfalls and How to Avoid Them
Learn from common mistakes to accelerate your DSPy mastery. Here are the most frequent pitfalls and their solutions.
1. Overcomplicating Signatures
❌ Mistake:
Creating overly complex signatures with too many fields from the start.
✅ Solution:
Start with simple signatures like "question -> answer" and add complexity only when needed. Iterate based on results.
2. Insufficient Training Data
❌ Mistake:
Using too few examples (less than 10) or non-representative data for optimization.
✅ Solution:
Aim for at least 50 examples for basic optimization, 300+ for best results. Ensure examples cover edge cases and reflect real-world distribution.
3. Ignoring Cost and Latency
❌ Mistake:
Running expensive optimizers with large models without considering costs.
✅ Solution:
Start optimization with smaller models (GPT-3.5), use aggressive caching, monitor token usage, and only scale up when necessary.
4. Poor Metric Design
❌ Mistake:
Using simplistic metrics that don't capture actual requirements.
✅ Solution:
Design multi-faceted metrics that consider accuracy, relevance, conciseness, and domain-specific requirements. Test metrics manually before optimization.
5. Not Testing Consistency
❌ Mistake:
Assuming one successful run means the system always works.
✅ Solution:
Run each test example multiple times. Use proper train/val/test splits. Monitor performance over time in production.
6. Premature Optimization
❌ Mistake:
Starting with complex optimizers like MIPROv2 before establishing baselines.
✅ Solution:
Follow this progression: Basic Predict → BootstrapFewShot → MIPROv2. Establish baselines at each stage.
7. Coupling Logic with Prompts
❌ Mistake:
Mixing business logic with prompt-specific implementation details.
✅ Solution:
Keep signatures declarative. Let DSPy handle prompt generation. Focus on defining what you want, not how to ask for it.
Real-World Case Studies
Learn how leading companies are using DSPy in production to solve real problems at scale.
JetBlue: Customer Service Automation
JetBlue uses DSPy for multiple chatbot applications across their customer service infrastructure. By leveraging DSPy's automatic optimization, they achieved:
- 60% reduction in prompt engineering time
- 35% improvement in customer satisfaction scores
- Seamless migration between different LLM providers
- Consistent performance across multiple languages
Replit: Automated Code Review
Replit employs DSPy pipelines to synthesize code diffs and provide intelligent code suggestions:
- Automated generation of code review comments
- Context-aware code suggestions
- 40% faster review cycles
- Reduced false positives by 75% using DSPy assertions
VMware: Enterprise RAG Systems
VMware has implemented DSPy for retrieval-augmented generation in their enterprise documentation systems:
- Processing 100,000+ technical documents
- 90% accuracy in technical query responses
- Automatic prompt optimization for different document types
- Multi-model ensemble for critical queries
Healthcare: Medical Report Analysis
Companies like Salomatic use DSPy for enriching and analyzing medical reports:
- HIPAA-compliant processing pipelines
- 99.9% reliability with assertion-based validation
- Automatic adaptation to different report formats
- 50% reduction in manual review time
Troubleshooting and Debugging
When things go wrong (and they will), here's how to diagnose and fix common issues.
Enabling Debug Mode
import dspy
import logging
# Enable detailed logging
logging.basicConfig(level=logging.DEBUG)
logging.getLogger("dspy").setLevel(logging.DEBUG)
# Configure DSPy with verbose mode
dspy.configure(
lm=dspy.LM('openai/gpt-4'),
experimental=True, # Enable experimental features
verbose=True # Show detailed output
)
# Inspect prompt history
def debug_last_call():
"""Inspect the last DSPy call."""
history = dspy.inspect_history(n=1)
for item in history:
print("=" * 50)
print(f"Prompt: {item['prompt']}")
print(f"Response: {item['response']}")
print(f"Tokens: {item.get('tokens', 'N/A')}")
print(f"Latency: {item.get('latency', 'N/A')}ms")
# Use in your code
result = your_program(input="test")
debug_last_call()Common Error Patterns and Solutions
Context Length Exceeded
# Error: Context length exceeded
# Solution 1: Reduce few-shot examples
teleprompter = BootstrapFewShot(
metric=your_metric,
max_bootstrapped_demos=2, # Reduce from default
max_labeled_demos=2
)
# Solution 2: Truncate contexts
class TruncatedRAG(dspy.Module):
def forward(self, question):
context = self.retrieve(question)
# Truncate to fit context window
max_tokens = 2000
truncated_context = context[:max_tokens]
return self.generate(context=truncated_context, question=question)Assertion Failures
# Debug assertion failures
class DebuggableModule(dspy.Module):
def forward(self, input):
try:
result = self.process(input)
# Add debug info before assertion
print(f"Result before assertion: {result}")
dspy.Assert(
your_condition(result),
f"Assertion failed. Result: {result}"
)
return result
except dspy.AssertionError as e:
print(f"Assertion error: {e}")
print(f"Input was: {input}")
print(f"Trace: {dspy.inspect_history(n=1)}")
# Try with suggestions instead
dspy.Suggest(
your_condition(result),
"Consider improving the result"
)
return resultOptimization Not Improving
# Diagnose optimization issues
def diagnose_optimization(program, train_data, metric):
"""Diagnose why optimization isn't working."""
# Test metric on training data
print("Testing metric on training data...")
scores = []
for example in train_data[:5]:
pred = program(**example.inputs())
score = metric(example, pred)
scores.append(score)
print(f"Example: {example.question[:50]}...")
print(f"Prediction: {pred.answer[:50]}...")
print(f"Score: {score}\n")
print(f"Average score: {sum(scores)/len(scores):.2f}")
# Check if metric is too strict
if sum(scores) == 0:
print("WARNING: Metric might be too strict!")
# Test with different optimizers
optimizers = [
("BootstrapFewShot", BootstrapFewShot(metric=metric)),
("MIPRO-light", MIPROv2(metric=metric, auto="light"))
]
for name, optimizer in optimizers:
print(f"\nTesting {name}...")
try:
optimized = optimizer.compile(
program.deepcopy(),
trainset=train_data[:10]
)
print(f"{name} succeeded")
except Exception as e:
print(f"{name} failed: {e}")Performance Profiling
import time
import statistics
from contextlib import contextmanager
@contextmanager
def profile_section(name):
"""Profile a code section."""
start = time.time()
try:
yield
finally:
duration = time.time() - start
print(f"{name}: {duration:.2f}s")
class PerformanceMonitor:
def __init__(self):
self.latencies = []
self.token_counts = []
def monitor(self, program, test_data):
"""Monitor program performance."""
for example in test_data:
start = time.time()
# Run prediction
result = program(**example.inputs())
# Record metrics
latency = time.time() - start
self.latencies.append(latency)
# Get token count from last call
history = dspy.inspect_history(n=1)
if history:
tokens = history[0].get('tokens', 0)
self.token_counts.append(tokens)
# Report statistics
print(f"Latency - Mean: {statistics.mean(self.latencies):.2f}s")
print(f"Latency - P95: {statistics.quantiles(self.latencies, n=20)[18]:.2f}s")
print(f"Tokens - Mean: {statistics.mean(self.token_counts):.0f}")
print(f"Tokens - Total: {sum(self.token_counts)}")
# Estimate costs (GPT-4 pricing example)
total_cost = sum(self.token_counts) * 0.00003 # $0.03 per 1K tokens
print(f"Estimated cost: ${total_cost:.2f}")Expert-Level Best Practices
These advanced patterns and practices will help you build production-grade AI systems with DSPy.
1. Modular Design Patterns
Structure your DSPy programs for maximum reusability and maintainability:
# Base components library
class DSPyComponents:
"""Reusable DSPy components library."""
@staticmethod
def create_retriever(k=5, similarity="cosine"):
"""Factory for retrievers."""
return dspy.Retrieve(k=k, similarity=similarity)
@staticmethod
def create_generator(reasoning_type="cot", signature=None):
"""Factory for generators."""
sig = signature or "context, question -> answer"
generators = {
"cot": dspy.ChainOfThought(sig),
"pot": dspy.ProgramOfThought(sig),
"react": dspy.ReAct(sig),
"predict": dspy.Predict(sig)
}
return generators.get(reasoning_type, dspy.ChainOfThought(sig))
@staticmethod
def create_validator(validation_type="json"):
"""Factory for validators."""
validators = {
"json": lambda x: is_valid_json(x),
"sql": lambda x: is_valid_sql(x),
"python": lambda x: is_valid_python(x)
}
return validators.get(validation_type)
# Configurable pipeline using composition
class ConfigurablePipeline(dspy.Module):
def __init__(self, config: dict):
super().__init__()
self.config = config
# Build pipeline from config
self.retriever = DSPyComponents.create_retriever(
k=config.get('retriever_k', 5)
)
self.generator = DSPyComponents.create_generator(
reasoning_type=config.get('reasoning', 'cot')
)
self.validator = DSPyComponents.create_validator(
validation_type=config.get('validation', 'json')
)
def forward(self, question):
# Retrieve context
context = self.retriever(question)
# Generate answer
answer = self.generator(context=context, question=question)
# Validate if configured
if self.validator and self.config.get('validate', False):
dspy.Assert(
self.validator(answer.answer),
"Validation failed",
backtrack=self.generator
)
return answer
# Usage with different configurations
qa_config = {
'retriever_k': 3,
'reasoning': 'cot',
'validate': False
}
math_config = {
'retriever_k': 1,
'reasoning': 'pot',
'validate': True,
'validation': 'python'
}
qa_pipeline = ConfigurablePipeline(qa_config)
math_pipeline = ConfigurablePipeline(math_config)2. Multi-Stage Optimization Strategy
class ProgressiveOptimizer:
"""Multi-stage optimization with increasing complexity."""
def __init__(self, base_program, metric):
self.base_program = base_program
self.metric = metric
self.optimization_history = []
def optimize(self, train_data, stages=None):
"""Run progressive optimization."""
stages = stages or [
('baseline', self._baseline_stage),
('bootstrap', self._bootstrap_stage),
('mipro_light', self._mipro_light_stage),
('mipro_heavy', self._mipro_heavy_stage),
('ensemble', self._ensemble_stage)
]
current_program = self.base_program
best_score = 0
best_program = current_program
for stage_name, stage_func in stages:
print(f"\n🔄 Running {stage_name} optimization...")
try:
# Run optimization stage
optimized = stage_func(current_program, train_data)
# Evaluate
score = self._evaluate(optimized, train_data[:20])
print(f"✅ {stage_name} score: {score:.2%}")
# Track history
self.optimization_history.append({
'stage': stage_name,
'score': score,
'improved': score > best_score
})
# Update best if improved
if score > best_score:
best_score = score
best_program = optimized
current_program = optimized
except Exception as e:
print(f"❌ {stage_name} failed: {e}")
self.optimization_history.append({
'stage': stage_name,
'score': 0,
'error': str(e)
})
return best_program
def _baseline_stage(self, program, data):
"""No optimization - baseline."""
return program
def _bootstrap_stage(self, program, data):
"""Bootstrap few-shot examples."""
optimizer = BootstrapFewShot(
metric=self.metric,
max_bootstrapped_demos=4
)
return optimizer.compile(program, trainset=data[:50])
def _mipro_light_stage(self, program, data):
"""Light MIPRO optimization."""
optimizer = MIPROv2(
metric=self.metric,
auto="light"
)
return optimizer.compile(program, trainset=data[:100])
def _mipro_heavy_stage(self, program, data):
"""Heavy MIPRO optimization."""
optimizer = MIPROv2(
metric=self.metric,
auto="heavy",
num_threads=8
)
return optimizer.compile(program, trainset=data)
def _ensemble_stage(self, program, data):
"""Create ensemble of programs."""
programs = []
# Create variations
for i in range(3):
optimizer = BootstrapFewShot(
metric=self.metric,
max_bootstrapped_demos=4 + i
)
variant = optimizer.compile(
program.deepcopy(),
trainset=data[i*50:(i+1)*50]
)
programs.append(variant)
return EnsemblePredictor(programs)
def _evaluate(self, program, test_data):
"""Evaluate program performance."""
scores = []
for example in test_data:
try:
pred = program(**example.inputs())
score = self.metric(example, pred)
scores.append(score)
except:
scores.append(0)
return sum(scores) / len(scores) if scores else 0
# Usage
optimizer = ProgressiveOptimizer(base_program, your_metric)
best_program = optimizer.optimize(train_data)
# Analyze optimization journey
for stage in optimizer.optimization_history:
print(f"{stage['stage']}: {stage.get('score', 0):.2%} "
f"{'✅' if stage.get('improved') else '❌'}")3. Advanced Metrics with LLM Judges
class MultiCriteriaLLMJudge(dspy.Module):
"""Advanced LLM-based evaluation with multiple criteria."""
def __init__(self):
super().__init__()
self.judge = dspy.ChainOfThought(
"""question, answer, criteria ->
scores: dict, overall_score: float,
strengths: list[str], improvements: list[str]"""
)
def forward(self, question, answer, criteria):
return self.judge(
question=question,
answer=answer,
criteria=criteria
)
def create_advanced_metric(criteria_weights: dict):
"""Create weighted multi-criteria metric."""
# Define evaluation criteria
criteria = """
Evaluate the answer on these criteria (0-1 scale each):
1. Accuracy: Factual correctness
2. Completeness: Addresses all aspects of the question
3. Clarity: Clear and well-structured
4. Relevance: Stays on topic
5. Conciseness: Appropriate length without unnecessary details
"""
judge = MultiCriteriaLLMJudge()
def metric(gold, pred, trace=None):
# Get multi-criteria evaluation
evaluation = judge(
question=gold.question,
answer=pred.answer,
criteria=criteria
)
# Parse scores
try:
scores = eval(evaluation.scores) # Use safe eval in production
# Calculate weighted score
weighted_score = sum(
scores.get(criterion, 0.5) * weight
for criterion, weight in criteria_weights.items()
)
# Normalize
total_weight = sum(criteria_weights.values())
final_score = weighted_score / total_weight
# Log detailed feedback
if trace:
trace['detailed_scores'] = scores
trace['strengths'] = evaluation.strengths
trace['improvements'] = evaluation.improvements
return final_score
except Exception as e:
print(f"Evaluation failed: {e}")
return 0.5 # Default middle score
return metric
# Create specialized metrics
accuracy_focused_metric = create_advanced_metric({
'Accuracy': 0.6,
'Completeness': 0.2,
'Clarity': 0.1,
'Relevance': 0.05,
'Conciseness': 0.05
})
user_friendly_metric = create_advanced_metric({
'Accuracy': 0.3,
'Completeness': 0.2,
'Clarity': 0.3,
'Relevance': 0.1,
'Conciseness': 0.1
})4. Production Monitoring and Observability
import json
from datetime import datetime
from typing import Dict, Any
import prometheus_client as prom
class DSPyObservability:
"""Production monitoring for DSPy applications."""
def __init__(self, service_name="dspy_service"):
self.service_name = service_name
# Prometheus metrics
self.request_counter = prom.Counter(
'dspy_requests_total',
'Total requests',
['module', 'status']
)
self.latency_histogram = prom.Histogram(
'dspy_latency_seconds',
'Request latency',
['module'],
buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
)
self.token_counter = prom.Counter(
'dspy_tokens_total',
'Total tokens used',
['model', 'module']
)
self.confidence_gauge = prom.Gauge(
'dspy_confidence_score',
'Confidence scores',
['module']
)
def monitor_module(self, module: dspy.Module):
"""Wrap a DSPy module with monitoring."""
original_forward = module.forward
module_name = module.__class__.__name__
def monitored_forward(*args, **kwargs):
# Start timing
start_time = time.time()
try:
# Run original forward
result = original_forward(*args, **kwargs)
# Record success metrics
latency = time.time() - start_time
self.request_counter.labels(
module=module_name,
status='success'
).inc()
self.latency_histogram.labels(
module=module_name
).observe(latency)
# Track confidence if available
if hasattr(result, 'confidence'):
self.confidence_gauge.labels(
module=module_name
).set(result.confidence)
# Log detailed telemetry
self._log_telemetry({
'module': module_name,
'status': 'success',
'latency': latency,
'timestamp': datetime.utcnow().isoformat(),
'input_size': len(str(args) + str(kwargs)),
'output_size': len(str(result))
})
return result
except Exception as e:
# Record failure metrics
self.request_counter.labels(
module=module_name,
status='failure'
).inc()
# Log error
self._log_error({
'module': module_name,
'error': str(e),
'timestamp': datetime.utcnow().isoformat()
})
raise
module.forward = monitored_forward
return module
def _log_telemetry(self, data: Dict[str, Any]):
"""Log telemetry data."""
# Send to your logging system
print(f"TELEMETRY: {json.dumps(data)}")
def _log_error(self, data: Dict[str, Any]):
"""Log error data."""
# Send to error tracking system
print(f"ERROR: {json.dumps(data)}")
# Usage
observability = DSPyObservability()
# Wrap your modules
monitored_qa = observability.monitor_module(qa_module)
monitored_rag = observability.monitor_module(rag_module)
# Expose metrics endpoint
from flask import Flask
app = Flask(__name__)
@app.route('/metrics')
def metrics():
return prom.generate_latest()Conclusion: Mastering DSPy
DSPy represents a paradigm shift in how we build AI applications. By separating task definition from implementation details, it enables truly scalable and maintainable AI systems. The framework's strength lies not just in its automatic optimization capabilities, but in its clean abstractions that make complex AI pipelines understandable and modular.
Key Takeaways for Success
- Start Simple: Begin with basic signatures and modules before adding complexity. Master the fundamentals before diving into advanced features.
- Focus on Data: Invest time in creating representative training examples and meaningful metrics. The quality of your data determines the quality of your optimization.
- Iterate Systematically: Use DSPy's optimization capabilities to improve performance automatically. Don't try to perfect everything manually.
- Design for Production: Consider caching, error handling, and monitoring from the beginning. Build with scalability in mind.
- Embrace the Philosophy: Think in terms of task specification rather than prompt crafting. Let DSPy handle the implementation details.
The Journey from Newbie to Expert
Your journey with DSPy will evolve through distinct phases:
- Newbie: Understanding signatures and basic modules
- Intermediate: Building pipelines and using optimizers
- Advanced: Creating custom modules and complex metrics
- Expert: Designing production systems with monitoring and optimization
The Future is Declarative
As you progress from DSPy novice to expert, remember that the framework's true power comes from its composability and automatic optimization. By mastering these concepts and following the best practices outlined in this guide, you'll be able to build robust, scalable AI applications that adapt and improve over time.
The future of AI development is declarative, optimizable, and modular. DSPy provides the foundation for this future, enabling developers to focus on solving real problems rather than wrestling with the intricacies of prompt engineering. Whether you're building simple question-answering systems or complex multi-agent workflows, DSPy's principled approach to language model programming will serve as your guide from newbie to expert.
Welcome to the future of AI development. Welcome to DSPy.
Further Reading
Continue your DSPy journey with these essential resources:
Key Resources
Complete documentation and API reference for the DSPy framework
Source code, examples, and community contributions
Comprehensive beginner's guide with hands-on examples
