Building an Event-Driven AI Pipeline with SNS/SQS

Introduction

Event in, event out. That was the entire philosophy behind the first production version of a multi-stage AI pipeline I built on AWS. The system ingested domain events (user feedback, support transcripts, survey responses, system alerts) and transformed them into actionable insights, tailored recommendations, and ready-to-use deliverables like documents, action plans, and structured reports. Every piece of generated content passed through LLM-as-judge quality gates before a human ever saw it.

If you want the full picture of how events were batched by priority, tenant, and user, and how the data flywheel continuously improved output quality, check out the companion post, Intelligent Batching & the Data Flywheel. This post focuses on the first production implementation: a fully event-driven pipeline built entirely on AWS SNS and SQS, where each processing stage was its own Lambda handler consuming from its own queue and publishing completion events to the next topic in the chain.

No shared state between stages. No orchestrator. Just events flowing through queues. Every handler was a pure function of its input message.

The v1 Architecture

I built the v1 pipeline on a single architectural principle: each stage is an independently deployable Lambda handler that reads from an SQS queue and publishes a completion event to an SNS topic. The SNS topic fans out to the next stage's queue. No stage knows about any other stage. No stage cares where its input came from or where its output goes.

This made the system remarkably easy to reason about in isolation. Each handler had one job. I could test it with a sample SQS message, deploy it independently, and monitor its queue depth to understand load. The full pipeline emerged from the composition of these simple, independent units.

A few things to note about this diagram. The solution throttle at the top was a rate limit: a maximum of 100 solutions generated per 24-hour window per tenant. Batches that arrived after hitting the cap were held in a PENDING state until the window reset. This was critical for managing LLM API costs during event spikes.

The priority routing split HIGH/MEDIUM events into the main pipeline for full generation. LOW priority events went through a trend analysis that could promote them to HIGH if it detected a significant pattern change.

Pipeline Stages

The diagram above shows the flow, but it helps to see each stage laid out with its specific responsibility. I found this table useful as a quick reference when debugging or onboarding new team members. Each stage in the pipeline had a clearly defined responsibility. Here's the full breakdown:

Stage	Handler	What It Does
Data Lake	`DataLakeHandler`	Event ingestion and persistent storage
Vectorization	`VectorHandler`	Generate embeddings for semantic search
Batch Collection	`BatchCollectorHandler`	Group events by priority, tenant, and user
Context Validation	`ContextValidationHandler`	Identify missing context, pause batch and request data if needed
Insight Generation	`InsightHandler`	Analyze event patterns, generate business insights
Recommendation	`RecommendationHandler`	Create actionable recommendations from insights
Solution Generation	`SolutionHandler`	Generate specific deliverables (documents, reports, action plans)
Notice Creation	`NoticeHandler`	Create user-facing notices and communications
Editorial Review	`EditorialHandler`	Review and refine all content for publication
Broadcasting	`BroadcastHandler`	Distribute to stakeholders (terminal)

What I loved about this design: adding a new stage was trivial. Create a handler, subscribe its queue to the upstream topic, publish to a new topic. No existing code changed. The pipeline grew by accretion, not modification.

LLM-as-Judge Quality Gates

I put every AI generation stage in the pipeline through an LLM evaluation gate before it could publish its completion event. This applied to the InsightHandler, RecommendationHandler, SolutionHandler, NoticeHandler, and EditorialHandler. The generation model produced the content, and then a separate model call scored that content against five criteria, each on a 0 to 1 scale. All five had to score at or above 0.7 for the stage to pass.

This wasn't a self-evaluation. The generation prompt and the evaluation prompt were completely separate. The evaluator had no knowledge of the generation instructions. It only saw the output and the evaluation rubric.

I designed this separation specifically to prevent the kind of self-reinforcing bias you get when a model grades its own work against its own instructions.

Evaluation Criteria

I designed the five criteria around what mattered most to end users and stakeholders. These were people who needed to trust the output enough to act on it:

Criterion	Question
Relevance	Is the content relevant and actionable for the target audience?
Tone Consistency	Does the output match the expected tone and style?
No System Leakage	Does the output avoid exposing implementation details?
User-Actionable	Does the output focus on actions the user can take, not system suggestions?
Clarity	Is the content clear, specific, and actionable?

The "No System Leakage" and "User-Actionable" criteria deserve a callout. Early in development, I found that LLMs had a tendency to leak implementation details into user-facing content. Think phrases like "based on your vectorized event embeddings" or "consider upgrading your batch processing configuration."

These criteria caught those leaks before they reached production. They were born from real failures, not theoretical concerns.

When a stage failed its quality gate, the batch was marked as FAILED and the AI debugger (covered in the next section) kicked in to provide diagnostic information.

I intentionally chose not to auto-retry on quality failures. If the model produced low-quality output, retrying with the same prompt and context rarely improved things. Better to surface the failure, diagnose it, and fix the root cause.

AI Debugger

I built every handler in the pipeline to inherit from a StageHandler base class that provided unified error handling across all stages. When any stage threw an exception, whether from an LLM API timeout, a malformed event, a quality gate failure, or any other runtime error, the AI debugger ran automatically as part of the error handling flow.

The debugger extracted the source code referenced in the stack trace (code reflection), bundled it with the error message and batch context, and sent it to a fast, cheap LLM call with a low temperature (0.3). The result was structured diagnostic output.

Debugger Outputs

I structured the output into four fields, each serving a distinct purpose:

summary: A one-to-two sentence plain language description of what went wrong. This was what showed up in dashboards and Slack notifications. Designed for a quick glance: is this urgent or can it wait?
rootCause: A detailed technical analysis of why the error occurred, including which code path was involved and what conditions triggered the failure. This was for the engineer who'd actually fix it.
suggestedFix: A step-by-step remediation plan written for a human developer. Not always perfect, but it consistently pointed you in the right direction and saved the first 15 minutes of any debugging session.
claudePrompt: A pre-formatted prompt you could copy and paste directly into Claude Code (or any AI coding assistant) to start working on the fix. It included the relevant file paths, the error context, and the suggested approach. This was the most popular feature on the team, and it turned a failed batch into a two-minute fix cycle.

Design Constraints

I designed the debugger with one overriding constraint: it must never fail itself. If the debugging LLM call timed out or returned garbage, the system fell back to storing the raw error and stack trace. If code reflection failed to extract source files, it sent the stack trace alone.

Every step had a fallback, and every fallback produced something useful. The worst case was always "you get the same information you would've had without the debugger." Never "the debugger crashed and now you have nothing."

I stored all diagnostic output directly in the batch progress record. I could query any failed batch and immediately see what happened, why, and how to fix it. This turned the batch progress table into a living incident log that was genuinely useful, not just a graveyard of stack traces.

What Worked

The v1 architecture served me well for months in production. Several aspects of the design proved to be genuinely strong decisions:

Rapid iteration: Each stage was independently deployable. I could update the InsightHandler's prompt, deploy it in isolation, and watch the results flow through without touching any other stage. During the first few months, I was iterating on prompts daily, and the ability to deploy a single handler without redeploying the entire pipeline was essential. No coordination, no deployment windows, no risk of breaking unrelated stages.
Natural backpressure: SQS gave me built-in buffering during traffic spikes. If a burst of events came in, say a tenant triggering a bulk import that generated dozens of events in minutes, the queues absorbed the spike. The handlers processed at their own pace. I never had to build a custom rate limiter or circuit breaker for inter-stage communication. SQS just handled it.
Simple mental model: Event in, event out. Every handler was a pure function of its input message. New team members could understand what any individual stage did within minutes of reading its code. The complexity lived in the AI prompts and business logic, not in the infrastructure plumbing. That was a deliberate choice. I wanted engineers spending their time on the hard problems, like making LLM output reliable and useful, rather than on distributed systems coordination.

Where v1 Broke Down

As the system matured and traffic grew, the cracks in the v1 architecture became impossible to ignore. Each problem was manageable in isolation, but together they painted a clear picture: I'd outgrown the pure SNS/SQS model.

No Process Isolation

This was the killer. During HIGH priority spikes, all SQS consumers competed for the same Lambda concurrency pool. A burst of high-priority events would starve the InsightHandler and SolutionHandler of compute, even though those stages were processing completely unrelated batches.

Different batches were effectively time-sliced across shared workers with no isolation guarantees. A batch that should've taken 90 seconds could take 8 minutes because it kept losing its Lambda slots to other batches.

Unmeasurable

Answering "how long does a batch take end-to-end?" required tracing across 12 separate queue/handler pairs in CloudWatch. I had to manually correlate log entries by batch ID. I built custom tooling to do this, but it was fragile and always lagged reality.

Answering "where did this batch fail?" was even worse. I had to check every handler's logs to find the one that dropped it. There was no single pane of glass for a batch's lifecycle.

Sequential Bottleneck

Notice generation processed one at a time through its queue. If a batch produced six notices (common for complex events), they queued up and processed sequentially even though they were completely independent of each other.

The same was true for solutions. I was leaving obvious parallelism on the table because the queue-per-stage model forced sequential processing within each stage.

Ad-hoc Retries

Each handler implemented its own retry and backoff logic for LLM API calls. Some used exponential backoff. Some used fixed delays. Some had three retries, some had five. There was no consistency.

When I wanted to change the retry strategy globally, I had to update every handler individually. The retry logic was also interleaved with the business logic, making both harder to test and reason about.

The Core Issue

It all came down to one thing: SNS/SQS gives you decoupled stages but no process isolation per unit of work. Every batch shares the same compute, the same queues, and the same concurrency limits as every other batch.

When your pipeline processes a dozen independent batches and each batch flows through a dozen stages, you need isolation at the batch level, not just at the stage level. That realization is what drove the migration to Step Functions, which I cover in detail in From SNS/SQS to Step Functions.

Conclusion

The v1 SNS/SQS architecture was the right starting point. It let me build, ship, and iterate on a complex multi-stage AI pipeline without getting bogged down in orchestration complexity. The event-driven model forced clean boundaries between stages. Those boundaries survived intact through the eventual migration to Step Functions.

If I was building a similar system from scratch today, I'd still start with this approach. The rapid iteration speed and simplicity of the pure event-driven model are invaluable when you're still figuring out what your pipeline stages should even be. The LLM-as-Judge quality gates and AI debugger both proved their worth from day one. They carried forward unchanged into v2.

The lesson isn't that SNS/SQS was wrong. It's that every architecture has a scaling boundary, and recognizing when you've crossed it matters more than picking the perfect architecture on day one.

Happy coding!