Skip to content
Replace Playwright with Two Commands: AI-Driven E2E Testing with SWEny

Replace Playwright with Two Commands: AI-Driven E2E Testing with SWEny

Testing
AI
E2E
Automation
CI/CD
Agentic Workflows
TypeScript
2026-04-06

Table of Contents

Introduction

I've maintained Playwright and Cypress suites across multiple production apps. You probably have too. You know the drill: a designer moves a button, changes a class name, or reorders a form, and suddenly half your E2E suite is red. You didn't break anything. The app works fine. But your tests are coupled to implementation details that have nothing to do with user behavior.

The maintenance cost isn't just the time you spend fixing selectors. It's the trust erosion. After a few false positives, your team stops taking E2E failures seriously. The suite becomes background noise. And that's worse than having no tests at all, because now you have false confidence.

What if you could replace all of that with two commands? No test framework. No selectors. No page objects. No assertion library. Just describe your flows in plain English, and an AI agent figures out the rest.

That's what SWEny does. Run sweny new, pick "End-to-end browser testing" from the menu, and a wizard generates workflow-based browser tests. Run them with sweny e2e run and an AI agent drives a real browser via the accessibility tree. The agent reads the page the way a screen reader does, reasons about what it sees, and interacts using element references. When the UI changes, the agent adapts. You maintain YAML files with natural language instructions, not test code.

This post walks through how it works, what gets generated, and how to customize it. I'll also show a real production case study: the KidMath.ai conversion funnel tested end-to-end, with actual run output and screenshot evidence.

Two Commands. That's It.

Terminal
# 1. Create your E2E workflows sweny new # → pick "End-to-end browser testing" from the menu # → choose flows, configure cleanup, done # 2. Run all generated tests sweny e2e run

sweny new is a unified project setup command. It opens an interactive picker where you choose what kind of workflow to create: a built-in template (PR review, issue triage, security audit, release notes), an AI-generated workflow from a description, or — the one we care about here — end-to-end browser testing. Pick that option and a dedicated E2E wizard takes over, asking which flows to test (registration, login, purchase, onboarding, upgrade, cancellation, or custom), collecting details per flow (URL paths, form fields, success criteria), and generating self-contained workflow YAML files in .sweny/e2e/. The run command loads those files, resolves template variables, and executes each one with an AI agent driving agent-browser.

Here's what the output looks like:

sweny e2e run
────────────────────────────────────────────────── E2E Test Run Target: http://localhost:3000 Run ID: 1712345678 ────────────────────────────────────────────────── ▶ registration.yml ✓ setup — ready (3.2s) ✓ test_registration — pass (12.4s) ✓ report — done (1.1s) ▶ purchase.yml ✓ setup — ready (2.1s) ✓ login — pass (8.3s) ✓ test_purchase — pass (15.7s) ✓ report — done (1.0s) Results: 2/2 workflows passed

Exit code 0 if everything passes, 1 if anything fails. CI-friendly out of the box.

What SWEny Handles For You

If you were to build this yourself, you'd need a TypeScript runner, a shell script to manage the browser daemon, a custom browser skill with tool handlers, template variable injection, and cleanup logic. That's a lot of plumbing for something that should be simple. SWEny handles all of it:

ConcernDIY ApproachNow (sweny e2e)
Workflow definitionWrite YAML by handGenerated by sweny new wizard
Browser automationCustom skill with tool handlersAgent drives agent-browser directly via shell commands
Daemon lifecycleShell script (run.sh) with trap/pollSetup node handles install, start, and readiness polling
Template variablesManual string replacement in runnerAuto-resolved from env vars and run ID
Test data cleanupCustom cleanup.ts per projectWizard generates backend-specific cleanup node
Runner / orchestrationCustom runner.ts with execute()sweny e2e run (built-in)
CI exit codesManual process.exit mappingBuilt-in: 0 = pass, 1 = fail
Dependencies@sweny-ai/core + tsx + yamlJust sweny (already installed)

The shift is that you no longer write any TypeScript for your E2E suite. The entire test suite lives in YAML files that the wizard generates and you customize. SWEny handles orchestration, the agent handles browser interaction, and the generated workflow handles the structure.

Here's the before and after in file count:

DIY: Custom Runner
e2e/ run.sh # daemon lifecycle runner.ts # custom orchestration cleanup.ts # test data cleanup skills/ browser.ts # 200+ line skill e2e-uat.yml # hand-written workflow package.json # 3 dependencies .env
Now: sweny new
.sweny/ e2e/ registration.yml # generated login.yml # generated purchase.yml # generated .env # appended by wizard

Seven files vs three generated YAML files and an env var.

The Generated Workflow

Each flow you select in the wizard becomes a self-contained workflow YAML file. The simplest case is a three-node DAG:

Here's the full YAML for a registration test:

.sweny/e2e/registration.yml
id: e2e-registration name: "E2E: Registration" description: End-to-end test for registration flow entry: setup nodes: setup: name: Browser Setup instruction: |- Install agent-browser if missing, start the daemon, poll until ready. output: type: object properties: status: type: string enum: [ready, fail] required: [status] test_registration: name: "Test: Registration" instruction: |- Navigate to {base_url}/signup. Take a snapshot. Fill the registration form: - email: {test_email} - password: {test_password} - name: E2E Test User Submit the form. Wait 3 seconds. Take a snapshot. Verify redirect to /dashboard. Take a screenshot as evidence. Report pass if registration succeeded, fail otherwise. output: type: object properties: status: { type: string, enum: [pass, fail] } error: { type: string } required: [status] report: name: Test Report instruction: |- Compile results from test_registration. Report pass/fail counts and any errors. edges: - from: setup to: test_registration when: "setup status is ready" - from: setup to: report when: "setup status is fail" - from: test_registration to: report

The structure is always the same: setup, test node(s), optional cleanup, report. The setup node bootstraps agent-browser (installing it if needed). Test nodes contain natural language instructions. The report node aggregates results. Conditional edges short-circuit to the report if setup fails.

Auth-dependent flows (purchase, onboarding, upgrade, cancellation) automatically get a login node inserted before the test node:

Each workflow is independent. A purchase workflow includes its own login step so it can run in isolation. No shared state between workflows.

How It Works Under the Hood

SWEny's e2e run command does five things:

  1. Discovers workflow files from .sweny/e2e/*.yml (or a specific file if you pass one).
  2. Builds template variables from environment variables. Auto-generates run_id, test_email, and test_password. Picks up any E2E_* env vars.
  3. Resolves variables in every node instruction by replacing {base_url}, {test_email}, etc. with real values.
  4. Executes each workflow sequentially using SWEny's DAG executor. The AI agent walks the accessibility tree, reasons about what it sees, and calls agent-browser shell commands to interact. Each node returns structured output (pass/fail). Conditional edges route the flow.
  5. Reports results with per-node timing and a pass/fail summary. Exits 0 or 1 for CI gating.

The agent doesn't use screenshots or vision. agent-browser reads the accessibility tree and returns a text representation with element references like @e1, @e2, @e3. The agent uses those refs to click, fill, and scroll. Text tokens only. No vision tokens. This is fast, cheap, and more reliable than pixel-based approaches.

The Wizard

Run sweny new and you get an interactive picker:

sweny new
? What do you want to do? PR Review Bot — automated code review on pull requests Issue Triage — classify, prioritize, and label incoming issues Security Audit — scan code and dependencies for security issues Release Notes — generate release notes from commits and PRs ● End-to-end browser testing — automated browser tests for your app Describe your own — AI-generated from your description Start blank — just set up config

Select End-to-end browser testing and SWEny hands off to a dedicated E2E wizard. It asks which user flows to cover:

Flow TypeWhat It Tests
RegistrationSignup form with auto-generated test credentials
LoginAuthentication with existing or generated credentials
PurchasePricing page through checkout (includes login)
OnboardingMulti-step wizard after login
UpgradePlan change flow after login
CancellationCancel/downgrade flow after login
CustomAny flow you describe in a sentence

For each flow, the wizard asks for the URL path, success criteria, and any flow-specific details (form fields for registration, payment provider for purchase). It generates one YAML file per flow in .sweny/e2e/and appends the necessary env vars to your .env.

The generated files are version-controllable. Check them into your repo so everyone on the team can run sweny e2e run with the same test suite. When you want to change what a test verifies, edit the YAML. When you want to add a flow, run sweny new again or copy an existing file.

Template Variables

Workflow instructions use {variable} placeholders that are auto-resolved at runtime:

VariableSourceExample
{base_url}E2E_BASE_URL env varhttp://localhost:3000
{run_id}Auto-generated timestamp or RUN_ID env var1712345678
{test_email}Auto-generated from run_ide2e-1712345678@yourapp.test
{test_password}Auto-generated from run_idE2eTest!1712345678
{email}E2E_EMAIL (falls back to test_email)test@example.com
{password}E2E_PASSWORD (falls back to test_password)secret123
{*}Any E2E_* env var (prefix stripped)E2E_API_KEY becomes {api_key}

The e2e-{timestamp}@yourapp.test naming convention is deliberate. Any email matching that pattern is guaranteed to be test data, which makes cleanup safe and simple.

Test Data Cleanup

When you enable cleanup in the wizard, SWEny generates a cleanup node that runs between the last test and the report. The node contains backend-specific instructions for the agent:

  • Supabase: Deletes test users via Auth Admin API using SUPABASE_SERVICE_ROLE_KEY
  • Firebase: Deletes test users via Firebase Admin SDK
  • PostgreSQL: Deletes test records matching e2e-{run_id} via DATABASE_URL
  • REST API: Calls your app's own admin endpoint with a service token
  • Other: Generic instructions for the agent to clean up however the app supports

Cleanup is convention-based. Test emails follow e2e-{timestamp}@yourapp.test, so the agent finds them by prefix match. No tracking table. No shared state. If the service key isn't available, the agent skips cleanup gracefully.

Cross-Node Data Flow

This is the pattern that makes everything self-contained. Each node's structured output is available as context to every subsequent node. No config files, no shared state, no template variables needed for runtime values.

The provision node generates credentials, creates test data, and outputs everything subsequent nodes need. Test nodes reference these values naturally:

Provision outputs, test nodes consume
# The provision node outputs runtime values provision: instruction: |- Generate a unique run ID: date +%s Create test email: e2e-<RUN_ID>@yourapp.test Create test user via admin API... Output: base_url, run_id, test_email, test_password, user_id output: properties: status: { type: string, enum: [ready, fail] } base_url: { type: string } user_id: { type: string } test_email: { type: string } test_password: { type: string } # Subsequent nodes reference values from provision's context test_signin: instruction: |- Use the test_email and test_password from the provision step. Navigate to <BASE_URL>/auth... # Claude knows these values — they're in its context test_purchase: instruction: |- Use the user_id from the provision step to update the subscription via PostgREST... # Claude remembers the user_id from provision's output

This works because each node in the DAG receives the accumulated context from all its predecessors. The agent doesn't need template variable resolution or environment variable lookups for test-specific values. It just remembers. Infrastructure credentials like $SUPABASE_URL and $SUPABASE_SERVICE_ROLE_KEY still come from env vars, read by the shell when the agent runs curl commands. Secrets stay in env vars, test data flows through node outputs.

Data Provisioning via curl

One of the biggest simplifications: test data provisioning lives directly in the workflow YAML as curl commands. No cleanup.ts, no Supabase SDK, no build step. The agent reads env vars and calls your backend's admin API.

Supabase provision node
provision: instruction: |- STEP 1 — Clean up stale test users: curl -s "$SUPABASE_URL/auth/v1/admin/users?per_page=100" \ -H "apikey: $SUPABASE_SERVICE_ROLE_KEY" \ -H "Authorization: Bearer $SUPABASE_SERVICE_ROLE_KEY" Find users matching e2e-*@yourapp.test. Delete each. STEP 2 — Create pre-confirmed test user: curl -s -X POST "$SUPABASE_URL/auth/v1/admin/users" \ -H "apikey: $SUPABASE_SERVICE_ROLE_KEY" \ -H "Authorization: Bearer $SUPABASE_SERVICE_ROLE_KEY" \ -H "Content-Type: application/json" \ -d '{"email": "<TEST_EMAIL>", "password": "<TEST_PASSWORD>", "email_confirm": true}' Extract the "id" from the response.

The same pattern works for inserting related data via PostgREST:

Direct table inserts
test_learner_setup: instruction: |- Insert a learner via Supabase PostgREST. Use the user_id from the provision step: curl -s -X POST "$SUPABASE_URL/rest/v1/learners" \ -H "apikey: $SUPABASE_SERVICE_ROLE_KEY" \ -H "Authorization: Bearer $SUPABASE_SERVICE_ROLE_KEY" \ -H "Content-Type: application/json" \ -d '{"parent_id": "<USER_ID>", "name": "Panda<RUN_ID>", "pin": "e2ePass7", "grade_level": "3rd-grade"}'

No cleanup script needed. The admin API is simple REST, and raw curl in the YAML instruction is all it takes.

Real-World: KidMath Conversion Funnel

Here's what this looks like in production. KidMath.ai's entire E2E suite is a single YAML workflow. No custom TypeScript. No test framework. Just the workflow file and sweny e2e run.

The workflow tests the full conversion funnel: parent sign-in, subscription purchase (Stripe checkout redirect), learner creation, and learner login via username + PIN.

Here's the actual output from a production run:

sweny workflow run .sweny/e2e/conversion-funnel.yml
▲ e2e-conversion-funnel ✓ setup 36.7s ✓ provision 24.4s ✓ test_signin 54.4s ✓ test_purchase 149.7s ✓ test_learner_setup 91.2s ✓ test_learner_login 81.8s ✓ cleanup 25.1s ✓ report 12.3s Workflow completed

8 nodes, all green, ~8 minutes total. The screenshots tell the story:

Sign-in result showing My Account page

Sign-in: redirected to My Account

Stripe Checkout page showing Supporter plan

Purchase: Stripe Checkout redirect

Learner dashboard showing welcome message

Learner login: dashboard with welcome

Every design decision that made this workflow reliable is captured in Lessons Learned below, with the reasoning behind each choice.

Why This Beats Playwright and Cypress

I'm not saying throw away your unit tests. I'm saying the way we write E2E tests is fundamentally broken, and this fixes it.

Traditional E2Esweny e2e
CSS selectors break on refactorsA11y tree snapshots adapt to UI changes
Coded test scripts to maintainNatural language YAML, generated by wizard
Test framework + assertion library + page objectssweny new + sweny e2e run
Brittle waits and retriesAgent reasons about page state
~$0/run but hours of maintenance~$0.10/run, zero maintenance
Fixed assertion logicAgent adapts (e.g., switches from sign-in to sign-up)
Per-project setup and plumbingPortable: same workflow, any web app

The adaptive behavior is the one that surprised me most. When a pre-provisioned sign-in fails (wrong password, user doesn't exist), the agent will try sign-up on its own. I didn't code that. The agent just reasons through it. In a Playwright suite, that's a hard failure. With an agent, it's a graceful fallback.

Cost

This is the question everyone asks, and the answer is surprisingly cheap.

ItemCost
Per scenario (text tokens, no vision)~$0.02 - $0.05 (Sonnet), varies by model
Full suite run (4-6 scenarios)~$0.10 - $0.20
Stripe test mode$0
agent-browserFree, open source
Browserbase (CI)Free tier available

The "no vision" part is key. Because agent-browser uses the accessibility tree instead of screenshots, you're paying for text tokens only. Vision tokens are significantly more expensive. A single screenshot can cost as much as an entire scenario run.

Compare this to the hidden cost of traditional E2E: an engineer spending 2-4 hours per sprint fixing broken selectors. At any reasonable hourly rate, $0.20/run pays for itself after one avoided maintenance session.

GitHub Actions

Use the packaged swenyai/e2e@v1 action. It installs Node, @sweny-ai/core, and agent-browser, runs agent-browser install to pull Chrome, executes sweny workflow run on your YAML, and uploads the results/ directory as an artifact on every run, passing or failing.

.github/workflows/e2e.yml
name: E2E Tests on: workflow_dispatch: inputs: base_url: description: "Target URL" required: false default: "https://yourapp.com" concurrency: group: e2e cancel-in-progress: true jobs: e2e: runs-on: ubuntu-latest timeout-minutes: 20 steps: - uses: actions/checkout@v4 - uses: swenyai/e2e@v1 with: workflow: .sweny/e2e/conversion-funnel.yml base-url: ${{ inputs.base_url }} claude-oauth-token: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }} env: # Anything your provision/test/cleanup nodes need SUPABASE_URL: ${{ secrets.SUPABASE_URL }} SUPABASE_SERVICE_ROLE_KEY: ${{ secrets.SUPABASE_SERVICE_ROLE_KEY }}

One step, no manual setup-node, no manual npm installs, no manual upload-artifact. The action handles all of it.

CI Tips

  • ubuntu-latest is the right default for web-based testing. It's cheaper, faster to provision, and ships with Chrome preinstalled so agent-browser works out of the box. Reach for macos-latest only when you need Claude to drive a native app through a simulator, which requires the macOS toolchain.
  • concurrency: cancel-in-progress prevents multiple E2E runs from stacking up and burning minutes.
  • timeout-minutes: 20 gives a buffer above SWEny's default 15-minute per-workflow timeout.
  • Override timeout with sweny e2e run --timeout 300000 for shorter or longer runs.
  • Run a single workflow with sweny e2e run .sweny/e2e/purchase.yml. Useful for debugging a flaky flow without burning time on the full suite, or for splitting long suites into parallel CI jobs.

Going Deeper: The Programmatic API

The CLI covers the common case. But SWEny's programmatic API is still available when you need full control: custom skills with tool handlers, dynamic workflow generation, or integration into a larger system.

Programmatic API (custom skills)
import { execute, ClaudeClient, createSkillMap } from "@sweny-ai/core"; import { browser } from "./skills/browser.js"; // Register custom skills with tool handlers const skills = createSkillMap([browser]); const claude = new ClaudeClient({ maxTurns: 80 }); // One function call runs the entire workflow const { results } = await execute( workflow, { run_id, base_url }, { skills, claude }, );

The CLI and the programmatic API use the same underlying executor. For most projects, the YAML-only approach is enough. But if you need typed tool handlers, dynamic workflow construction, or programmatic result processing, the API gives you that control.

You can also run workflows directly with sweny workflow run path/to/workflow.yml for one-off runs outside the E2E harness.

Beyond E2E: What Else Can You Orchestrate?

Remember the sweny new picker? E2E browser testing is one option. The same command also offers built-in workflow templates and an AI-powered "describe your own" option that generates a workflow from a sentence. The underlying model — DAG workflows with natural language instructions and structured output — is general-purpose. Once you see how it works for browser automation, you start seeing DAGs everywhere:

  • Content pipelines. Node 1 researches a topic, node 2 writes a draft, node 3 reviews, node 4 publishes. Conditional edges handle "needs revision" loops.
  • Data processing. Ingest from an API, transform with an LLM, validate against a schema, load into a database. Branching on validation failures.
  • Code review automation. Check out a PR, run analysis, review the diff against conventions, post structured feedback.
  • Multi-agent coordination. One node generates a plan, another executes it. SWEny handles the handoff and the data contract between them.

SWEny is a workflow orchestration framework for AI agents. E2E testing is just a particularly good fit because the problem maps naturally to "sequence of steps with structured output and conditional routing."

If you want to see real examples of each of these patterns, the SWEny Marketplace is a growing library of community workflows organized by category: Triage, Security, DevOps, Code Review, Testing, Content, Ops. You can browse the DAG for each one, see which integrations it needs, remix it into your own repo, or publish your own. The marketplace runs the same workflow format as .sweny/e2e/*.yml, so anything you build locally is already publishable.

Lessons Learned

Things I discovered the hard way across KidMath.ai, Awdio.ai, and Offload. Some of these cost me hours. Save yourself the debugging sessions.

  1. Close stale sessions before every run. agent-browser close --all in the setup node. Without it, the agent sees leftover pages from previous runs, or a completely different app if another project used the daemon.
  2. Poll for daemon readiness, don't sleep. agent-browser get url in a retry loop (30-second timeout) in the setup node beats a hard sleep 2. Daemon startup time varies between local and CI, and a hardcoded sleep is both flaky and slower than it needs to be.
  3. The agent adapts more than you expect. When sign-in fails, it tries sign-up on its own. I didn't code that. The agent just reasons through it. Design your test instructions to explicitly support the intended path, or the agent will find its own.
  4. Be specific in instructions. "Verify a heading exists" will always pass. "Verify a heading about math education for kids" actually tests content. Vague instructions produce false positives.
  5. Screenshots are evidence, not assertions. The accessibility tree snapshot is the verification mechanism. Screenshots go in results/ for debugging and CI artifacts. Never gate pass/fail on screenshot content.
  6. Use curl, not SDKs. The Supabase Admin API is simple REST. Raw curl in the YAML cleanup node handles it with zero dependencies. No reason to pull in an SDK for three API calls.
  7. Convention-based test data makes cleanup bulletproof. The e2e-{timestamp}@yourapp.test pattern means cleanup can find and delete test users with a simple prefix match, even across crashed runs. No tracking table needed.
  8. Route ALL failures through cleanup. Test failures must not leave stale data. Every test node's failure edge should go to cleanup, then report. The only exception: setup failure (nothing to clean up yet).
  9. The provision node is the single source of truth. It generates credentials, reads env vars, creates test data, and outputs everything. Subsequent nodes reference its output. No globals, no config files, no coordination.
  10. Bypass complex UI when testing isn't the goal. KidMath's avatar wizard is 9 steps. Testing it is a separate scenario. For the conversion funnel test, I insert the learner via PostgREST and verify the UI reflects it. Test the thing you're testing.
  11. Verify redirects, not third-party UIs. The purchase test confirms Stripe checkout redirect happened (proves the edge function works) but doesn't fill Stripe's payment form. That's Stripe's responsibility.
  12. Port conflicts are silent killers. If E2E_BASE_URL points to the wrong port, all tests pass against the wrong app. Always verify the target URL first.
  13. Start with the conversion funnel. Sign-up or sign-in, one core feature, and one flow that touches your payment system. Get those green before expanding. The YAML is easy to extend once the pattern is working.

Wrapping Up

That's the whole thing. Two commands replace your entire E2E suite. The wizard generates YAML, the runner executes it with an AI agent driving a real browser, and you maintain prose instead of selectors. The YAML you write today keeps working when you redesign the sign-up form tomorrow, because the agent reads the accessibility tree and reasons about intent, not class names.

Try it on your current project:

Get started in under a minute
npm install -g @sweny-ai/core sweny new # pick "End-to-end browser testing" sweny e2e run # run the generated tests

Pick one flow, get it green, then expand. Start with whatever converts users: sign-up, sign-in, or checkout. Those are the flows where broken tests cost you the most, and they're also the ones where an adaptive agent earns its keep. Once you have one workflow passing, run sweny new again to add the next one — the picker is safe to re-run in an existing project and won't clobber your config.

When you're ready to do more than browser testing, the SWEny Marketplace is a good next stop. It's a free community library of pre-built DAG workflows for everything from alert triage to code review automation. Same format as the files your E2E wizard just generated, so anything you pick up there drops into .sweny/workflows/ and runs with sweny workflow run.

The shift is subtle but worth stating plainly: you no longer describe how to test your app, you describe what you want verified. The agent figures out the how. That's the whole promise, and after running this pattern in production across multiple apps, I can tell you it holds up.

Resources

Everything you need to get started and go deeper.

SWEny

SWEny Docs

Official documentation for SWEny workflows, skills, the executor API, and the CLI.

SWEny GitHub

Source code for the SWEny CLI and @sweny-ai/core library.

SWEny E2E GitHub

Standalone repo for the SWEny E2E pattern: ready-to-run workflow templates, examples, and the quickest path to marketplace surface area for browser testing.

SWEny Cloud

Managed SWEny workflows with hosted execution, monitoring, and team collaboration.

SWEny Marketplace

Free community library of pre-built DAG workflows: triage, security audits, code review, content pipelines, and more. Browse, remix, or publish your own.

E2E CLI Docs

Full reference for sweny new, sweny e2e run, and the CLI commands, including template variables and cleanup options.

Browser Automation

agent-browser

CLI daemon for browser automation via accessibility tree snapshots. The browser control layer in this pattern. Source at github.com/vercel-labs/agent-browser, npm package agent-browser.

Browserbase

Hosted browser sessions for CI environments without local Chrome. Free tier available.

AI and APIs

Claude API

Anthropic's Claude models. SWEny uses Claude via the Agent SDK for reasoning and tool use.

Claude Code

Anthropic's agentic coding tool. CLAUDE_CODE_OAUTH_TOKEN lets SWEny use Claude without burning API credits.

Supabase Auth Admin API

The admin API used for test user provisioning and cleanup in the Supabase backend option.