Skip to content
From Cron to Real-Time: Hardening an Autonomous Triage Agent

From Cron to Real-Time: Hardening an Autonomous Triage Agent

AI
SRE
Observability
SWEny
Sentry
Agentic Workflows
DevOps
AWS Lambda
2026-04-18

Table of Contents

The Gap

In The Triage Agent, I showed how to set up an autonomous agent that watches your logs, files tickets, and opens fix PRs. It runs on a cron schedule, sweeping your observability stack every few hours and acting on what it finds.

That works. But it has a blind spot: timing. A cron that fires every 6 hours means a critical error can sit unnoticed for up to 6 hours. You could shorten the interval, but then you're burning CI minutes on runs that find nothing. The scheduled sweep is wide (it scans everything) but it's slow to react.

What you actually want is both: a wide scheduled scan that catches patterns over time, and a focused reactive trigger that fires the moment something breaks. Two sides of the same coin.

Two Modes of the Same Agent

The same triage workflow handles both modes. The difference is scope, not structure:

Scheduled triage is a patrol. Reactive triage is a dispatch. The patrol catches the things nobody noticed. The dispatch responds to the thing that just happened. You want both.

The Relay: Sentry to GitHub Actions

The problem: Sentry can fire webhooks, but GitHub Actions can't listen on a URL. Actions are triggered by events inside GitHub: pushes, PRs, schedules, or repository_dispatch. So you need something in the middle that receives the Sentry webhook and converts it into a GitHub dispatch event.

The relay is a single Lambda behind a Function URL. No API Gateway needed; Function URLs give you an HTTPS endpoint for free. The Lambda does four things:

  1. Verifies the Sentry signature. HMAC-SHA256 against the integration's client secret. Rejects tampered payloads.
  2. Extracts issue metadata. Title, short ID, level, culprit, project slug. Just the fields the triage agent needs.
  3. Maps project to repo. A lookup table routes each Sentry project to the correct GitHub repo (monorepo or standalone).
  4. Dispatches to GitHub. A repository_dispatch event with the metadata flattened into client_payload.
sentry-relay/index.mjs
import { createHmac } from 'node:crypto'; const PROJECT_REPO_MAP = { 'api-server': 'your-org/your-monorepo', 'web-client': 'your-org/your-monorepo', 'mobile-app': 'your-org/your-monorepo', 'data-service': 'your-org/data-service', }; const DEFAULT_REPO = 'your-org/your-monorepo'; function verifySignature(body, signature, secret) { const hmac = createHmac('sha256', secret); hmac.update(body, 'utf8'); const expected = hmac.digest('hex'); return signature === expected; } export async function handler(event) { const { GITHUB_TOKEN, SENTRY_CLIENT_SECRET } = process.env; // Verify Sentry webhook signature const signature = event.headers?.['sentry-hook-signature']; if (signature && !verifySignature(event.body, signature, SENTRY_CLIENT_SECRET)) { return { statusCode: 401, body: 'Invalid signature' }; } const payload = JSON.parse(event.body); // Only process triggered issue alerts if (payload.action !== 'triggered' || !payload.data?.issue) { return { statusCode: 200, body: 'Skipped: not an issue alert' }; } const issue = payload.data.issue; const project = issue.project?.slug || 'api-server'; const repo = PROJECT_REPO_MAP[project] || DEFAULT_REPO; // Flatten metadata into client_payload const clientPayload = { title: issue.title, short_id: issue.shortId || '', url: `https://sentry.io/organizations/your-org/issues/${issue.id}/`, level: issue.level, culprit: issue.culprit || '', first_seen: issue.firstSeen, project, }; // Dispatch to GitHub await fetch(`https://api.github.com/repos/${repo}/dispatches`, { method: 'POST', headers: { Accept: 'application/vnd.github+json', Authorization: `Bearer ${GITHUB_TOKEN}`, }, body: JSON.stringify({ event_type: 'sentry-alert', client_payload: clientPayload, }), }); return { statusCode: 200, body: `Dispatched to ${repo}` }; }

Wiring the Workflow

The GitHub Actions workflow handles both triggers with a single job. The trick is conditional expressions that change the triage parameters based on how the workflow was invoked:

.github/workflows/sweny-triage.yml
name: SWEny Triage on: schedule: - cron: '0 14 1-31/2 * *' # every 2 days, 10am ET repository_dispatch: types: [sentry-alert] # from the relay Lambda workflow_dispatch: # manual trigger for testing permissions: contents: write issues: write pull-requests: write jobs: triage: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: swenyai/triage@v1 with: claude-oauth-token: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }} # Observability — both Loki and Sentry queried in parallel observability-provider: 'loki,sentry' sentry-project: >- ${{ github.event.client_payload.project || 'api-server' }} # Issue tracking issue-tracker-provider: linear linear-api-key: ${{ secrets.LINEAR_API_KEY }} linear-team-id: ${{ vars.LINEAR_TEAM_ID }} # Tuning — reactive vs scheduled time-range: >- ${{ github.event_name == 'repository_dispatch' && '1h' || '48h' }} service-filter: >- ${{ github.event.client_payload.project || '*' }} investigation-depth: >- ${{ github.event_name == 'repository_dispatch' && 'thorough' || 'standard' }} # Context injection for reactive mode additional-instructions: >- ${{ github.event_name == 'repository_dispatch' && format( 'REACTIVE TRIAGE — Sentry alert. Focus on: {0} | {1} | {2}', github.event.client_payload.short_id, github.event.client_payload.title, github.event.client_payload.url) || 'SCHEDULED TRIAGE — scan all services.' }}

The key lines are the ternary expressions. When github.event_name == 'repository_dispatch', the workflow narrows its scope: 1-hour window instead of 48, single service instead of all, thorough investigation instead of standard. The Sentry metadata from the relay's client_payload gets injected directly into the agent's instructions.

Same workflow. Same DAG. Same nodes. The parameters just change the aperture.

What Broke (And What That Taught Us)

Here's the part nobody writes about: the first four end-to-end runs failed. Not because the architecture was wrong (the relay, dispatch, and workflow all worked fine). The failures were in the agent's behavior inside the DAG nodes.

Each failure revealed a pattern of how LLM agents silently go wrong when you give them multi-step instructions:

Failure 1: Scope Creep

The investigate node was supposed to analyze errors and classify findings. Instead, it decided to be helpful and started creating Linear issues and opening PRs, jobs that belong to downstream nodes. The DAG's structure says "investigate first, then file tickets," but the agent doesn't see the DAG. It sees its instructions and a set of available tools. If the tools are there and the agent thinks it would be helpful to use them, it will.

Failure 2: Skipped Verification

The investigate node is supposed to check the issue tracker for duplicates before classifying a finding as "novel." In practice, the agent looked at the context from the gather node, saw that it already had enough information, and reasoned its way out of making any tool calls. It classified findings as novel based purely on its own judgment without actually searching. Result: duplicate tickets for an issue that already existed in Linear.

Failure 3: Tool Name Collision

SWEny exposes MCP tools like github_create_pr and linear_create_issue. But Claude Code also has its own built-in deferred tools like create_pull_request and get_issue. When the create_issue node was told to "not create PRs," it obeyed for the MCP tool but found the native tool with a different name and used that instead. The instruction was followed literally but not in spirit.

Failure 4: Missing Idempotency

The same Sentry alert can fire multiple times. On the second trigger, the create_issue node found the existing ticket from the first run but didn't know what to do with it. The node was written to create issues, not to handle the "already exists" case. The verify check then failed because no create tool was called.

The Verify Pattern

Prompt instructions alone don't prevent these failures. You can write "you MUST search the issue tracker" in bold caps, and the agent will still sometimes skip the search if it thinks it already has enough context. The solution is structural: verify post-conditions that check what the agent actually did, not what it said it did.

SWEny's workflow nodes support a verify block that runs after the agent completes. It inspects the tool call log and fails the node if required actions weren't taken:

triage.yml — verify blocks
nodes: investigate: name: Root Cause Analysis instruction: >- Classify findings as novel or duplicate. You MUST search the issue tracker before classifying anything as novel. verify: # If the agent made 0 search calls, it skipped the # novelty check entirely — fail and retry. any_tool_called: - linear_search_issues - github_search_issues create_issue: name: Create Issues instruction: >- Create Linear issues for novel findings. First check if a prior run already created a matching issue. verify: any_tool_called: - linear_create_issue - github_create_issue - linear_search_issues # idempotency search - linear_add_comment # +1 on duplicate create_pr: name: Open Pull Request instruction: >- Push the branch and open a PR using github_create_pr. verify: any_tool_called: - github_create_pr

The any_tool_called check is simple: at least one of the listed tools must have been called successfully during the node's execution. If none were, the node fails and gets retried with feedback about what was missing.

This is the key insight: you can't trust an LLM to follow process instructions reliably, but you can verify the artifacts it produced. Did it actually call the search tool? Did it actually create a ticket? Did it actually open a PR? These are binary checks on the tool call log, not subjective evaluations of output quality.

Scope Boundaries in Instructions

Verify catches omissions. For scope creep (doing too much) you need explicit boundaries in the instructions. Every node now ends with a scope block:

Scope boundaries
investigate: instruction: >- ...analysis instructions... IMPORTANT — scope boundaries for this node: - DO NOT create issues. The create_issue node handles that. - DO NOT create branches, commits, or pull requests. - DO NOT call linear_create_issue, github_create_issue, create_pull_request, or github_create_pr. - Your ONLY job is read, search, classify, and output.

Note that both the MCP tool names (github_create_pr) and the native tool names (create_pull_request) are listed. You have to be explicit about both because the agent sees both in its tool inventory.

Idempotency: Same Alert, No Duplicate Tickets

Reactive triage creates a problem that scheduled triage doesn't have: the same error can trigger multiple webhooks. A spike of 500s might fire Sentry's alert rule three times in an hour. Without idempotency handling, that's three identical Linear tickets.

The fix is an idempotency check at the top of the create_issue node:

Idempotency in create_issue
create_issue: instruction: >- For each NOVEL finding: 1. First, check if a prior triage run already created an issue for this exact bug. Search the issue tracker with the error message or root cause. If a matching issue already exists: - DO NOT create a new issue. - Populate issueIdentifier, issueTitle, and issueUrl from the existing issue. - Add a "+1" comment if appropriate. - Set the action to "updated" in the issues array. 2. If no matching issue exists, create a new one.

The verify block was widened to accept search and comment tools alongside create tools. The node passes whether it creates a new issue or finds an existing one. Both are valid outcomes.

The Full Picture

After five E2E test runs and four upstream framework PRs, the reactive triage pipeline looks like this:

What the Agent Did on the First Successful Reactive Run

  1. Gather: Pulled the Sentry error details and recent Loki logs for the affected service. Checked recent commits and PRs for related changes.
  2. Investigate: Made 10 tool calls, searching Linear for matching issues by error message, module path, and symptom. Found an existing ticket with the same root cause. Classified the finding as a duplicate.
  3. Skip: Added a "+1, seen again" comment on the existing issue with new context from the latest occurrence.
  4. Notify: Posted a summary. No new ticket, no PR, no noise. Exactly right.

The agent correctly identified a duplicate on its first reactive run. That's the verify pattern working: the structural check forced it to actually search before classifying, and the search revealed the existing ticket.

Setting Up Sentry Alert Rules

On the Sentry side, you need an Internal Integration and alert rules that POST to the Lambda's Function URL:

  1. Create an Internal Integration in Sentry (Settings → Integrations → Internal). Give it read access to Issues and Projects. Copy the Client Secret for HMAC verification.
  2. Add a Webhook URL: your Lambda Function URL.
  3. Create Alert Rules per project. Set conditions that match your needs, e.g., "when a new issue is created" or "when an issue is seen more than 10 times in 1 hour." Use the Internal Integration as the action.

Resources

Prior posts in this series and everything you need to get started.

Related Posts

The Triage Agent

The setup guide: how to configure SWEny triage from scratch with Loki, Linear, and GitHub Actions.

The Pipeline Is Dead

The philosophy: why deterministic DAGs with LLM agents replaced months of AWS pipeline engineering.

SWEny

SWEny Docs

Official documentation for workflows, skills, the CLI, and provider configuration.

SWEny GitHub

Source code for the SWEny CLI and @sweny-ai/core library.

SWEny Triage Action

The swenyai/triage@v1 GitHub Action for autonomous SRE triage.

Workflow Spec

The SWEny workflow language specification, including verify blocks and structured outputs.

Infrastructure

Sentry Internal Integrations

How to create a Sentry Internal Integration for webhook-based alerts.

GitHub repository_dispatch

GitHub's API for triggering workflow runs from external systems.