From SNS/SQS to Step Functions

Introduction

Last post in the series. In the first post, I covered the intelligent batching system and data flywheel that formed the conceptual core of the platform: how dual-level batching, priority cascading, context validation, and interactive data collection turned the pipeline into a system that actively improved itself over time. In the second post, I walked through the v1 SNS/SQS architecture: the event-driven pipeline that got us to production, the LLM-as-judge quality gates, the AI debugger, and the operational pain points that eventually forced me to rethink the orchestration layer.

This post is about the migration. What moved, what stayed, and why. I'll cover the key insight that made Step Functions the right tool, the hybrid boundary between SNS/SQS and Step Functions, the multi-track architecture that v2 unlocked, and the four patterns that made it all work.

The Key Insight

The v1 pain points all stemmed from a single root cause: I was using per-event infrastructure to manage per-batch workflows. Every batch flowed through the same shared queues, the same shared worker pools, and the same shared log groups as every other batch. There was no concept of process isolation. A batch wasn't a first-class entity in the infrastructure; it was just a series of messages that happened to share a correlation ID.

The insight that changed everything was simple: once a batch is ready, the remaining work is a single unit of work with a defined lifecycle. The stages are sequential. There's one parallelizable fan-out (notice generation), but the overall flow follows a predictable path: context validation, insights, recommendations, solutions, notices, editorial review, broadcast. Each stage depends on the output of the previous one. Each stage can succeed or fail. The whole thing has a start time and an end time.

That's exactly what state machines are for.

AWS Step Functions gave me what SNS/SQS couldn't: a single execution per batch, with built-in state management, visual observability, declarative retry policies, and native parallelism. The handlers stayed the same. The orchestration changed completely.

The Hybrid Boundary

Not everything moved. That was deliberate, and getting the boundary right was one of the most important architectural calls in the migration.

The ingestion path stayed on SNS/SQS. Events arrive one-at-a-time from a variety of sources: webhooks, API calls, questionnaire responses, external integrations. Each event needs to be durably received, written to the data lake, vectorized, and assigned to the right batches. This is per-event, high-throughput, stateless, fire-and-forget work. SNS/SQS is perfect for it.

Everything downstream of BATCH_READY moved to Step Functions. Once the batch collector determines that a batch has hit its threshold and is ready for processing, the work shifts from per-event to per-batch. It's stateful, sequential, observable, and needs to be isolated. Different requirements call for different tools.

The boundary sits at BATCH_READY. Everything above it is event-driven infrastructure handling individual signals at high throughput. Everything below it is workflow orchestration handling isolated batches through a defined lifecycle. The BatchCollectorHandler is the bridge: it consumes from SQS and starts a Step Functions execution when a batch is ready.

What Changed

The migration wasn't a rewrite. The Lambda handlers that did the actual work (calling the LLM, evaluating responses, writing to the database) stayed largely the same. What I changed was how I invoked, retried, sequenced, and observed those handlers. Here's how v1 and v2 compare:

Concern	v1 (SNS/SQS)	v2 (Step Functions)
Process isolation	None. All batches share worker pools.	One execution per batch
Observability	Trace across 12 queues in CloudWatch	Single execution graph, visual in console
Duration tracking	Impossible to measure per-batch	Built-in start/end per batch and per stage
Error location	Grep logs across handlers	Failed state visible at a glance
Retries	Custom per-handler	Declarative retry policies on each state
Parallelism	Sequential queue chain	Map state for fan-out (notice generation)
Throughput under load	Degraded. Batches time-slice.	Isolated. Batch A doesn't slow batch B.

Every pain point from v1 turned into a solved problem in v2, not through custom code, but through choosing the right orchestration primitive.

Multi-Track Architecture

In v1, the pipeline had a single track. Every batch went through the same stages, producing output targeted at the org admin. That was fine initially, but the platform's value proposition required reaching different audiences with different content. An admin needs org-wide analysis and strategic recommendations. A team lead needs group-level insights and actionable solutions. An end user needs personalized feedback and relevant notifications.

In v2, I introduced tracks: parallel state machines that targeted different audiences from the same batch. When a batch was ready, the Track Router inspected the batch metadata and launched the appropriate track executions. Each track got its own Step Function definition. Independent deployments. Independent failure domains. If the Individual track failed, the Admin and Team tracks kept running.

The tracks weren't identical pipelines with different prompts. They had genuinely different stage compositions. The Admin track included context validation (the full org-wide analysis). The Team track skipped context validation but included role-specific solution types. The Individual track was the leanest: personalized insights, recommendations, and notices without the heavier analytical stages.

Track	Audience	Scope	Key Differences
Admin	Org admin	Org-wide	Full pipeline with context validation, all deliverable types
Team	Team leads	Team-scoped	Role-specific prompts, team-level deliverables
Individual	End users	Personal	Personalized insights and notifications

This design meant I could iterate on each track independently. When I wanted to add a new stage to the Team track, I deployed that Step Function definition without touching the Admin or Individual tracks. When the Individual track needed a different retry strategy because its prompts were simpler, I tuned that in isolation. The tracks were loosely coupled by design, united only by the batch that triggered them.

Step Function Patterns

Four patterns emerged during the migration that I think are broadly applicable to anyone building AI pipelines with Step Functions.

Pattern 1: Retry Loops with LLM Eval

Let the state machine handle retries so your handlers stay single-purpose and testable.

Every AI generation stage in the pipeline followed the same shape: generate content, evaluate it with an LLM judge, and either pass the result forward or retry. In v1, this retry logic was implemented ad-hoc in each handler. Some handlers retried three times. Some retried five. Some had exponential backoff. Some didn't. The logic was scattered, inconsistent, and invisible to anyone who wasn't reading the handler code.

In v2, the retry loop became a declarative state machine pattern. The Generate state invoked the Lambda handler. The Eval state invoked the LLM judge. A Choice state checked the score. If it met the threshold (0.7 or higher), the execution advanced to the next stage. If not, a retry counter was checked. If retries remained, execution looped back to Generate. If retries were exhausted, the execution failed with a structured error.

The beauty of this pattern is that there's zero retry code in the handlers. The handler does one thing: generate content and return a result. The state machine handles the retry logic, the counter management, and the failure escalation. This made the handlers simpler, more testable, and completely reusable across tracks.

Pattern 2: Map State Fan-Out

Use Map state to parallelize notice generation and turn the biggest bottleneck into the fastest stage.

This was the single biggest throughput win of the migration. In v1, notice generation was sequential. When a batch produced twenty solutions, each solution generated a notice, and those notices were processed one at a time through the SQS queue. A batch with forty notice targets could take minutes just in the notice stage.

In v2, I used the Step Functions Map state. The solution stage output an array of notice targets. The Map state iterated over that array, launching a parallel execution for each target. Each parallel branch ran the full generate-eval-retry loop independently. MaxConcurrency was configurable per track. The Admin track could fan out wider than the Individual track, which typically had fewer targets anyway.

The results were dramatic. Notice generation went from being the throughput bottleneck to one of the fastest stages. And because each parallel branch had its own retry loop, a single failed notice didn't block the others. The Map state collected all results (successes and failures) and passed them to the Editorial stage, which could make informed decisions about what to include in the final broadcast.

Pattern 3: Error Handling Evolution

Move the AI debugger into a Catch state so error analysis shows up right in the execution graph.

One of the things I was most proud of in v1 was the AI debugger: when a pipeline stage failed, the error handler would perform code reflection from the stack trace, send the context to an LLM, and produce a structured analysis with a root cause, suggested fix, and a prompt you could paste into Claude to continue debugging. In v1, this lived in the StageHandler base class, a method called inline during error handling. It worked, but I ended up storing the output separately, disconnected from the pipeline flow.

In v2, the AI debugger became a Catch state in the Step Function. Every stage had a Catch configuration that routed errors to a sequence of error-handling states: record the failure, invoke the AI debugger, store the analysis, and mark the batch as FAILED. Same concept, same handler code, but now integrated into the state machine and visible in the execution graph.

The difference in operational experience was night and day. In v1, when a batch failed, I'd see the error in the handler's CloudWatch logs, then go hunting for the AI debugger output in the batch progress record. In v2, I opened the execution, saw the red state, and the debugger output was right there in the state's output payload. Error, analysis, context. All in one place.

Pattern 4: Process Isolation = Measurability

One execution per batch gives you exact timing, visual debugging, and zero cross-batch interference.

Less of a pattern, more of a consequence. But it was the most impactful change in day-to-day operations. In v1, I couldn't answer the question "how long does it take to process a batch?" The batch wasn't a unit of execution. It was a series of messages flowing through shared infrastructure, interleaved with messages from every other batch. Measuring anything per-batch required correlating timestamps across twelve log groups and hoping none of them had gaps.

In v2, every execution gave me exact duration, per-stage timing, visual error location, full input/output at each state, and zero cross-batch interference. The Step Functions console showed me the execution graph with timing on every transition. I could see exactly how long each stage took, compare batch durations side by side, and click into any execution to understand why one was slower than another.

Debugging went from "grep CloudWatch across 12 log groups and correlate by batch ID" to "open the execution, see the red state, click it." That single change probably saved me more time than any other part of the migration.

Performance Impact

The performance story is straightforward. In v1, throughput degraded under load. When multiple HIGH priority batches arrived during a spike (say, a wave of events after a configuration change) they all competed for the same Lambda concurrency across the same SQS queues. The pipeline didn't fail, but it slowed down. Batches that normally processed in minutes could take hours during a spike.

In v2, spikes meant more concurrent Step Functions executions, not slower individual batches. Batch A ran in its own execution. Batch B ran in its own execution. They shared nothing. If ten HIGH priority batches arrived simultaneously, ten executions launched simultaneously, and each one completed in roughly the same time as if it were the only batch in the system.

This was a fundamental shift. The system went from "more load equals slower processing" to "more load equals more parallel processing at consistent speed."

During HIGH priority spikes, v1 processing times measured in hours. v2 processing times measured in minutes. The individual batch duration stayed roughly constant regardless of system load. That's the kind of performance characteristic you want in a system that processes time-sensitive events.

Why This Worked

Looking back at the migration, I distilled six lessons that I think apply broadly to anyone building event-driven AI pipelines.

Start with SNS/SQS. For rapid iteration and early-stage development, the simple mental model of publish-subscribe with queues is hard to beat. It gets you to production fast and forces good decoupling habits.
Recognize the boundary. Per-event work and per-batch work are fundamentally different problems. The ingestion path is high-throughput, stateless, and fire-and-forget. The processing path is stateful, sequential, and needs observability. Use different tools for different problems.
Migrate when the shape stabilizes. I didn't start with Step Functions because the pipeline stages were still changing weekly in v1. Once the stages settled and the lifecycle became predictable, I migrated in a few days: same handlers, better orchestration.
Use tracks for audience targeting. Independent Step Function definitions per audience give you independent failure domains, independent deployment cycles, and the freedom to evolve each track at its own pace.
LLM-as-judge at every stage. Quality gates with declarative retry loops are a natural fit for state machines. The handlers stay simple. The orchestration handles the complexity.
Make errors useful. The AI debugger pattern (structured error analysis with root cause, suggested fix, and continuation prompt) becomes even more powerful when it's integrated into the state machine and visible in the execution graph.

None of this was revolutionary technology. I composed standard AWS services (SNS, SQS, Step Functions, Lambda) thoughtfully to match the shape of the problem. v1 was the right architecture for the stage of the product. v2 was the right architecture for the stage that followed. The migration wasn't about replacing a bad system with a good one. It was about recognizing when the requirements had outgrown the tooling and picking the right next step.

Happy coding!