The Pipeline Is Dead: How I Replaced Months of AWS Engineering with Natural Language

Introduction
What Months of Pipeline Engineering Looks Like
The Question That Changed Everything
SWEny: Orchestrate Reasoning, Not Operations
Proof: A Production Pipeline in Under a Day
Why It Actually Works in Production
The Comparison
The Claim
Resources

Introduction

I spent months building a production AI pipeline. Two full architecture versions. Twelve Lambda handlers chained through SNS/SQS, then migrated to Step Functions. Custom dual-level batching with priority cascading tuned from real usage data. LLM-as-judge quality gates at every generation stage. An AI debugger that performed code reflection on stack traces and produced structured root-cause analyses. A complete migration from a fully event-driven SNS/SQS topology to a hybrid Step Functions orchestration layer with multi-track fan-out. Three blog posts documenting the whole journey. Then I built an open-source library called SWEny, gave it a natural language description of what I wanted, and had something functionally equivalent running in under a day. This post is about what that means.

What Months of Pipeline Engineering Looks Like

Before I explain why I think the traditional pipeline is dead, you need to understand what I am actually killing. This was not a toy. It was months of careful, iterative engineering in production, and I documented every stage of it.

The v1 architecture was built on a single philosophy: event in, event out. Twelve stages, each its own Lambda handler consuming from its own SQS queue and publishing a completion event to the next SNS topic in the chain. No stage knew about any other stage. No shared state. No orchestrator. Just events flowing through queues, each handler a pure function of its input message. I wrote about that architecture in detail in Building an Event-Driven AI Pipeline with SNS/SQS.

Underneath that pipeline sat a batching system I had tuned from months of real usage patterns. Every incoming event was assigned to both tenant-level and user-level batches, each with priority thresholds calibrated to actual operational needs. HIGH priority (a critical alert): one event fires the pipeline immediately. MEDIUM priority (a routine feedback cluster): accumulate 100 events before processing. LOW priority (trend analysis): wait for 2,000 events to build enough signal for meaningful patterns.

Priority cascading meant a single HIGH event would promote an entire pending batch, pulling queued MEDIUM and LOW events along with it. I covered that system, along with the data flywheel that made the whole thing compound over time, in Intelligent Batching & the Data Flywheel.

Then came the realization that broke the architecture open: I was using per-event infrastructure to manage per-batch workflows. Every batch flowed through the same shared queues, the same shared worker pools, the same shared log groups. There was no process isolation. A batch was just a series of messages that happened to share a correlation ID.

I could not measure how long a batch took to process. I could not isolate a failure to a single batch without grepping twelve log groups. Retries were ad-hoc, per-handler, inconsistent.

So I migrated. The hybrid boundary landed at BATCH_READY: everything upstream (event ingestion, data lake writes, vectorization, batch collection) stayed on SNS/SQS where high-throughput, stateless, fire-and-forget semantics were the right fit. Everything downstream moved to Step Functions, where each batch became its own isolated execution with built-in state management, visual observability, and declarative retry policies.

The migration introduced multi-track processing: Admin, Team, and Individual tracks running as independent Step Function definitions with independent failure domains. Map state fan-out for parallel notice generation. LLM-eval retry loops expressed declaratively in the state machine instead of scattered through handler code. I documented that migration in From SNS/SQS to Step Functions.

It was good engineering. The v1 architecture got the product to market. The v2 architecture solved every operational pain point that v1 exposed. The batching system was sophisticated, tuned, and battle-tested. The quality gates caught bad generations before users ever saw them. I was proud of it.

It was also months. Months of designing event schemas, wiring SNS topics to SQS queues, tuning batch thresholds, building custom retry logic, migrating to Step Functions, defining state machines in ASL, configuring Map states, testing failure modes, and writing three blog posts to explain it all. Months of work that I now believe was the last generation of how we will build these systems.

The Question That Changed Everything

Here is the question that broke my mental model: What if each step in your pipeline wasn't a function, but an agent?

I didn't arrive at this from reading a blog post or watching a conference talk. I arrived at it from operating the systems I just described. After months of building, migrating, and maintaining two generations of AI pipeline architecture, I started noticing a pattern that I couldn't unsee: the overwhelming majority of the codebase wasn't doing anything intelligent.

It was routing. It was retrying. It was orchestrating around the LLM calls. The SNS topics, the SQS queues, the Lambda handlers, the Step Function state machine definitions in ASL, the retry policies, the DLQ plumbing, the CloudWatch alarms. All of it existed to move data between the moments where the actual value happened: the reasoning.

The reasoning itself, the LLM calls that generated insights, evaluated quality, produced recommendations, was a tiny fraction of the codebase. I was writing hundreds of lines of infrastructure code to wire together what were fundamentally natural language tasks. "Look at this data and find patterns." "Evaluate whether this output meets our quality bar." "Generate a recommendation based on these insights." Each of those is a sentence. I had turned each one into a Lambda handler, an SQS queue, an SNS topic, a retry policy, and a deployment configuration.

Traditional pipelines orchestrate operations. But what if you could orchestrate reasoning instead?

That distinction is the whole insight. You still need deterministic orchestration. You still need the DAG, the routing, the retry logic, the structured data handoffs between steps. That layer does not go away. But the work inside each node should be agent reasoning, not a hardcoded function. Two layers: a deterministic shell orchestrating an intelligent core.

SWEny: Orchestrate Reasoning, Not Operations

SWEny is an open-source framework for building reliable autonomous AI workflows. The tagline is simple: Turn natural language into reliable AI workflows. It is the direct result of the two-layer insight I just described: deterministic DAG orchestration on top, agent reasoning (Claude plus tools) at each node. The orchestration layer handles routing, retries, and structured data handoffs. The agent layer handles the actual thinking.

Here is what a workflow node looks like in practice. The YAML on the left defines the node and its routing; the diagram on the right is what SWEny actually executes.

A SWEny workflow node

nodes:
  triage_issues:
    name: Triage Open Issues
    instruction: >-
      Look at the most recent open issues in the repo.
      Identify which ones describe a bug versus a feature request.
      Report the issue numbers in each category and a total bug count.
    skills:
      - github
    output:
      type: object
      properties:
        bugs:
          type: array
          items:
            type: integer
        bug_count:
          type: integer
      required:
        - bugs
        - bug_count

edges:
  - from: triage_issues
    to: file_followups
    when: "bug_count is greater than 0"
  - from: triage_issues
    to: report_clean
    when: "bug_count is 0"

Look at what is happening here. The node's instruction is natural language. Not a Lambda handler. Not a function body. A plain English description of what the agent should do. The skills array controls which integrations the agent can access at runtime: GitHub, Slack, Linear, Sentry, Supabase, custom MCP servers, whatever you configure.

The output schema is the contract. SWEny instructs the agent that it must end with a JSON object matching the schema, then parses that JSON into the node's output for downstream nodes to read. The edges route conditionally based on those parsed fields, using plain English conditions that Claude evaluates at runtime. No JSONPath expressions. No Choice state syntax. Just "bug_count is greater than 0."

Think about what the traditional equivalent looks like. To build this same step in a conventional pipeline, I would write a Lambda handler (50+ lines of TypeScript with error handling), configure an SQS queue, wire an SNS topic for the completion event, implement retry logic with exponential backoff, deploy the infrastructure via CloudFormation or CDK, then monitor via CloudWatch dashboards and alarms. That is the cost of a single step. Multiply by twelve stages and you have the months of work I described in the previous sections.

With SWEny, I describe the step in plain English, define the output schema, declare the routing conditions, and run it. The framework handles execution, retries, tool access, and structured output validation. The agent reasons through the task. The orchestrator ensures the workflow proceeds correctly. Two layers, each doing what it does best.

Execution is flexible. Run workflows from the CLI with sweny workflow run for local development and testing. Integrate as a GitHub Action for CI/CD pipelines that need AI reasoning steps. Or use the programmatic TypeScript API when you need to embed workflows into a larger application. Same YAML definition, same execution semantics, multiple entry points.

I want to be precise about what SWEny is not. It is not a chatbot framework. It is not a prompt chain. It is a workflow engine where each node is an autonomous agent with controlled tool access and enforced output contracts. Routing between nodes is deterministic, defined declaratively. The orchestration is as rigid as a Step Function. The execution inside each node is as flexible as a conversation with Claude. That combination is the point.

Proof: A Production Pipeline in Under a Day

I described a pipeline architecture that took months to build. Now let me show you what happened when I applied SWEny to a real, production problem: extracting structured regulatory data from 50+ government jurisdictions. This is not a demo. It is running. It took less than a day.

The problem is brutal. 50+ jurisdictions, each publishing through its own government website with no APIs and no consistency. Some post PDFs. Some have HTML pages buried three links deep. Some publish administrative codes in dense legal prose that reads like it was written to be deliberately opaque. You need structured, queryable JSON out the other end: up to 9 categories per jurisdiction covering numeric thresholds, rules, fee structures, contact information, and more. The source material is written for lawyers and bureaucrats, not for machines.

Here is the SWEny workflow DAG that solves it:

Check Sources

Runs on a schedule via GitHub Actions. The agent checks each registered source URL, computes SHA-256 content hashes, and compares against previously stored hashes. If nothing changed, the workflow exits early. If changes are detected, execution routes to Download.

Discover Sources

Handles initial setup for a new jurisdiction. The agent searches official government domains, validates that URLs resolve to legitimate .gov sites, and follows links one level deep to find sub-pages containing the regulatory data. Once sources are registered, Check Sources takes over for ongoing monitoring.

Download Artifacts

Fetches the raw content (PDFs, HTML pages, administrative code sections) and stores it with content hashes for change detection on future runs.

Extract & Parse

This is where agent reasoning earns its keep. The agent reads unstructured legal prose and government formatting, then produces structured JSON conforming to defined schemas across up to 9 categories. This is not regex extraction. The agent has to understand context, resolve ambiguous references between sections, normalize inconsistent terminology across jurisdictions, and produce clean typed output from source material that was never designed to be machine-readable.

Validate

Runs schema validation and completeness checks against required fields. This is deterministic: the output either conforms or it does not.

LLM Judge

The quality gate. A separate Claude instance reads the raw source text alongside the extracted JSON and verifies accuracy against a 0.9 confidence threshold. This is not self-evaluation. The judge has no knowledge of the extraction instructions. It reads the source, reads the output, and decides independently whether the extraction is faithful.

On failure, the judge produces field-level feedback: the specific field that failed, the extracted value, the relevant source text, and what went wrong. That feedback routes back to the Extract node as additional context for re-extraction. Up to 3 attempts. On pass, the validated JSON publishes to S3. On persistent failure, the jurisdiction gets flagged for human review.

Think about what each of those nodes would require in a traditional pipeline. Discovery alone is a Lambda handler, a headless browser or HTTP client, URL validation logic, a data store for known sources, and deployment infrastructure. The LLM Judge is another Lambda, another queue, prompt engineering, confidence parsing, retry wiring.

In SWEny, each of those is a node with a natural language instruction, an output schema, and an edge. The patterns that took me months to learn in v1 and v2 (LLM-as-judge quality gates, feedback-driven retry loops, progressive data coverage) are expressed as workflow edges and node instructions. Not infrastructure code.

The system currently covers 6 jurisdictions with 48 sources tracked, and expanding to a new jurisdiction means adding source URLs and running the Discover node. No new Lambda handlers. No new queues. No deployment. Just a workflow execution.

Why It Actually Works in Production

I can hear the skeptic. "Nice demo. Call me when it handles production traffic." Fair. I was that skeptic six months ago. So let me address every production concern directly. This is not a proof of concept. It runs on a schedule, processes real regulatory data, and expands coverage every week without code changes.

LLM Judge Quality Gate

The judge enforces a 0.9 confidence threshold with field-level error details, not just pass/fail. The judge reads the raw source material alongside the extracted JSON and evaluates accuracy independently. It has no knowledge of the extraction instructions. This is the same LLM-as-judge pattern from the v1 pipeline, where I had a judge at every generation stage. The difference: in v1, that was a custom Lambda handler with prompt construction, confidence parsing, and retry wiring. In SWEny, it is a workflow node with a natural language instruction and a structured output schema.

Feedback-Driven Retry Loops

When the judge fails an extraction, it does not just say "try again." It produces the same structured, field-level feedback described above. That feedback feeds back to the extraction node as additional context for re-extraction. The agent sees exactly what it got wrong and why. Up to 3 attempts before escalation to human review via automatic ticket creation. In SWEny YAML, this is an edge with a condition and max_iterations: 3. In Step Functions, the same pattern required Choice states, an attempt counter maintained in the execution state, and careful ASL to prevent infinite loops.

Change Detection

HTTP HEAD requests plus SHA-256 content hashing. Sources are checked on a weekly schedule. Unchanged content is never reprocessed. The system tracks ETags, Last-Modified headers, and content hashes for each source URL. If the government website returns the same content, the workflow exits early and no extraction runs. Server instability is handled gracefully with consecutive failure tracking: a source that returns a 500 once does not get marked as changed, and a source that returns a 500 three times in a row gets flagged for investigation rather than silently dropped.

Execution Tracing

Every tool call, every routing decision, every node result is recorded. Full observability into what the agent did, what it decided, and why. This was the v1 pain point that drove the Step Functions migration in the first place: debugging went from grepping across 12 CloudWatch log groups to opening the execution trace. SWEny provides the same structured visibility without the AWS console. One trace per execution, every decision captured, every tool invocation logged with inputs and outputs.

Mode-Based Operation

Four operational modes from a single workflow definition. Monitor: check all registered sources for changes and report. Retry-failed: re-extract only the jurisdictions that failed the judge on a previous run. Expand: discover and onboard a new jurisdiction end-to-end. Auto: run monitor, then retry-failed, then expand in sequence. One YAML file defines all four modes. The scheduled GitHub Action runs auto mode weekly. When something fails, I run retry-failed manually. When I want to verify sources are stable without triggering extractions, I run monitor. Same workflow, different entry points.

Progressive Coverage

The system does not need all 50+ jurisdictions before going live. It launched with whatever was ready. Auto mode expands incrementally, adding one new jurisdiction per scheduled run. Week one: 3 jurisdictions. Week two: 4. Week six: 6 jurisdictions with 48 tracked sources. The system gets more valuable every week without a single code change. No new Lambda handlers. No new queues. No deployment pipeline. The workflow adds a jurisdiction the same way a human would: find the sources, download the content, extract the data, validate it, judge it, publish it. The only difference is that it happens autonomously on a schedule.

Conditional retry edges

edges:
  - from: llm_judge
    to: extract_and_parse
    when: "any category failed with confidence below 0.9"
    max_iterations: 3
  - from: llm_judge
    to: publish
    when: "all categories passed with confidence at or above 0.9"

That retry edge is the same pattern I built with Choice states, attempt counters, and custom ASL in Step Functions. Here it is two lines of YAML.

The Comparison

Dimension	Traditional Pipeline (v1 + v2)	SWEny Workflow
Architecture iterations	2 major versions	1 YAML file
Time to production	Months	Under a day
Infrastructure	Lambda, SNS, SQS, Step Functions, CloudWatch	Workflow definition + utility scripts
Adding a stage	New handler + queue + wiring + deploy	New node in YAML
Retry logic	Custom per-handler (v1) / state machine config (v2)	`max_iterations` on an edge
Observability	CloudWatch across 12 log groups (v1) / Step Functions console (v2)	Built-in execution trace
Quality gates	Custom LLM eval handlers	Node instruction + confidence threshold

I am not mocking the traditional approach. It was the right architecture for its stage of the product. The v1 pipeline got the product to market. The v2 migration solved real operational pain. Every architecture has a scaling boundary, and recognizing when you have crossed it matters more than picking the perfect architecture on day one.

But the boundary I crossed this time was not about scale or throughput. It was about the abstraction layer itself. The paradigm has shifted.

The Claim

The pipeline is not dead for moving rows between databases. Airflow, Step Functions, and traditional ETL are the right tools when your data is structured and your transforms are deterministic. That world has not changed.

The pipeline is dead for anything that requires reasoning.

Regulatory data scattered across government websites. Research papers in PDFs that need structured extraction. Support tickets that need classification, root-cause analysis, and routing. Financial filings with inconsistent formatting across jurisdictions. Medical records with free-text narratives that need structured coding. Any domain where the data requires understanding, not just transformation, is a candidate. If your pipeline steps are fundamentally natural language tasks wrapped in infrastructure code, you are overengineering the problem.

Describe what you want. Define the routing. Let the agents reason. That is the whole pitch.

Try SWEny. The full backstory is in the three posts that document everything this library was built to replace: Intelligent Batching & the Data Flywheel, Building an Event-Driven AI Pipeline with SNS/SQS, and From SNS/SQS to Step Functions.

Resources

The tool, the pipeline series that motivated it, and the infrastructure and AI building blocks referenced throughout the post.

SWEny

SWEny Docs

Official documentation for SWEny workflows, skills, the executor API, and the CLI.

SWEny GitHub

Source code for the SWEny CLI and @sweny-ai/core library.

SWEny E2E GitHub

Standalone repo for the SWEny E2E pattern: ready-to-run workflow templates, examples, and the quickest path to marketplace surface area for browser testing.

SWEny Cloud

Managed SWEny workflows with hosted execution, monitoring, and team collaboration.

SWEny Marketplace

Free community library of pre-built DAG workflows: triage, security audits, code review, content pipelines, and more. Browse, remix, or publish your own.

The Pipeline Series

Intelligent Batching & the Data Flywheel

Dual-level batching with priority cascading, context validation, and the feedback loop that compounded AI output quality over time.

Building an Event-Driven AI Pipeline with SNS/SQS

The v1 architecture: twelve Lambda handlers chained through SNS/SQS, LLM-as-judge quality gates, and the AI debugger.

From SNS/SQS to Step Functions

The v2 migration to hybrid Step Functions orchestration, multi-track fan-out, and per-batch process isolation.

Traditional Pipeline Stack

AWS Step Functions

Serverless workflow orchestration with ASL, Map states, and retry policies. The v2 backbone this post argues against as a reasoning layer.

Amazon SNS

Pub/sub messaging used in the v1 architecture to fan events out between pipeline stages.

Amazon SQS

Managed queue service used as the stage-to-stage buffer in the original event-driven topology.

AWS Lambda

Stateless compute behind every handler in the v1 pipeline, from ingestion to quality gates to delivery.

AI and APIs

Claude API

Anthropic's Claude models. SWEny uses Claude via the Agent SDK for reasoning and tool use at every node.

Claude Agent SDK

The SDK that gives each workflow node autonomous reasoning, structured output, and controlled tool access.

Model Context Protocol

The open protocol SWEny skills use to expose tools (GitHub, Slack, Linear, Sentry, Supabase, and custom servers) to the agent layer.