From Cron to Real-Time: Hardening an Autonomous Triage Agent

The Gap
Two Modes of the Same Agent
The Relay: Sentry to GitHub Actions
Wiring the Workflow
What Broke (And What That Taught Us)
The Verify Pattern
Idempotency: Same Alert, No Duplicate Tickets
The Full Picture
Resources

The Gap

In The Triage Agent, I showed how to set up an autonomous agent that watches your logs, files tickets, and opens fix PRs. It runs on a cron schedule, sweeping your observability stack every few hours and acting on what it finds.

That works. But it has a blind spot: timing. A cron that fires every 6 hours means a critical error can sit unnoticed for up to 6 hours. You could shorten the interval, but then you're burning CI minutes on runs that find nothing. The scheduled sweep is wide (it scans everything) but it's slow to react.

What you actually want is both: a wide scheduled scan that catches patterns over time, and a focused reactive trigger that fires the moment something breaks. Two sides of the same coin.

Two Modes of the Same Agent

The same triage workflow handles both modes. The difference is scope, not structure:

	Scheduled (Wide)	Reactive (Focused)
Trigger	Cron schedule	Sentry webhook
Time range	48 hours	1 hour
Service scope	All services	The one that errored
Investigation depth	Standard	Thorough
Purpose	Catch patterns, surface trends	Investigate one specific error deeply

Scheduled triage is a patrol. Reactive triage is a dispatch. The patrol catches the things nobody noticed. The dispatch responds to the thing that just happened. You want both.

The Relay: Sentry to GitHub Actions

The problem: Sentry can fire webhooks, but GitHub Actions can't listen on a URL. Actions are triggered by events inside GitHub: pushes, PRs, schedules, or repository_dispatch. So you need something in the middle that receives the Sentry webhook and converts it into a GitHub dispatch event.

The relay is a single Lambda behind a Function URL. No API Gateway needed; Function URLs give you an HTTPS endpoint for free. The Lambda does four things:

Verifies the Sentry signature. HMAC-SHA256 against the integration's client secret. Rejects tampered payloads.
Extracts issue metadata. Title, short ID, level, culprit, project slug. Just the fields the triage agent needs.
Maps project to repo. A lookup table routes each Sentry project to the correct GitHub repo (monorepo or standalone).
Dispatches to GitHub. A repository_dispatch event with the metadata flattened into client_payload.

sentry-relay/index.mjs

import { createHmac } from 'node:crypto';

const PROJECT_REPO_MAP = {
  'api-server':  'your-org/your-monorepo',
  'web-client':  'your-org/your-monorepo',
  'mobile-app':  'your-org/your-monorepo',
  'data-service': 'your-org/data-service',
};

const DEFAULT_REPO = 'your-org/your-monorepo';

function verifySignature(body, signature, secret) {
  const hmac = createHmac('sha256', secret);
  hmac.update(body, 'utf8');
  const expected = hmac.digest('hex');
  return signature === expected;
}

export async function handler(event) {
  const { GITHUB_TOKEN, SENTRY_CLIENT_SECRET } = process.env;

  // Verify Sentry webhook signature
  const signature = event.headers?.['sentry-hook-signature'];
  if (signature && !verifySignature(event.body, signature, SENTRY_CLIENT_SECRET)) {
    return { statusCode: 401, body: 'Invalid signature' };
  }

  const payload = JSON.parse(event.body);

  // Only process triggered issue alerts
  if (payload.action !== 'triggered' || !payload.data?.issue) {
    return { statusCode: 200, body: 'Skipped: not an issue alert' };
  }

  const issue = payload.data.issue;
  const project = issue.project?.slug || 'api-server';
  const repo = PROJECT_REPO_MAP[project] || DEFAULT_REPO;

  // Flatten metadata into client_payload
  const clientPayload = {
    title: issue.title,
    short_id: issue.shortId || '',
    url: `https://sentry.io/organizations/your-org/issues/${issue.id}/`,
    level: issue.level,
    culprit: issue.culprit || '',
    first_seen: issue.firstSeen,
    project,
  };

  // Dispatch to GitHub
  await fetch(`https://api.github.com/repos/${repo}/dispatches`, {
    method: 'POST',
    headers: {
      Accept: 'application/vnd.github+json',
      Authorization: `Bearer ${GITHUB_TOKEN}`,
    },
    body: JSON.stringify({
      event_type: 'sentry-alert',
      client_payload: clientPayload,
    }),
  });

  return { statusCode: 200, body: `Dispatched to ${repo}` };
}

The payload must be flat. GitHub's client_payload is accessible in Actions expressions as github.event.client_payload.field_name. If you nest objects, the expressions can't reach them withoutfromJSON() gymnastics. Keep it flat.

Wiring the Workflow

The GitHub Actions workflow handles both triggers with a single job. The trick is conditional expressions that change the triage parameters based on how the workflow was invoked:

.github/workflows/sweny-triage.yml

name: SWEny Triage

on:
  schedule:
    - cron: '0 14 1-31/2 * *'   # every 2 days, 10am ET
  repository_dispatch:
    types: [sentry-alert]        # from the relay Lambda
  workflow_dispatch:              # manual trigger for testing

permissions:
  contents: write
  issues: write
  pull-requests: write

jobs:
  triage:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: swenyai/triage@v1
        with:
          claude-oauth-token: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }}

          # Observability — both Loki and Sentry queried in parallel
          observability-provider: 'loki,sentry'
          sentry-project: >-
            ${{ github.event.client_payload.project || 'api-server' }}

          # Issue tracking
          issue-tracker-provider: linear
          linear-api-key: ${{ secrets.LINEAR_API_KEY }}
          linear-team-id: ${{ vars.LINEAR_TEAM_ID }}

          # Tuning — reactive vs scheduled
          time-range: >-
            ${{ github.event_name == 'repository_dispatch' && '1h' || '48h' }}
          service-filter: >-
            ${{ github.event.client_payload.project || '*' }}
          investigation-depth: >-
            ${{ github.event_name == 'repository_dispatch'
              && 'thorough' || 'standard' }}

          # Context injection for reactive mode
          additional-instructions: >-
            ${{ github.event_name == 'repository_dispatch'
              && format(
                'REACTIVE TRIAGE — Sentry alert. Focus on: {0} | {1} | {2}',
                github.event.client_payload.short_id,
                github.event.client_payload.title,
                github.event.client_payload.url)
              || 'SCHEDULED TRIAGE — scan all services.' }}

The key lines are the ternary expressions. When github.event_name == 'repository_dispatch', the workflow narrows its scope: 1-hour window instead of 48, single service instead of all, thorough investigation instead of standard. The Sentry metadata from the relay's client_payload gets injected directly into the agent's instructions.

Same workflow. Same DAG. Same nodes. The parameters just change the aperture.

What Broke (And What That Taught Us)

Here's the part nobody writes about: the first four end-to-end runs failed. Not because the architecture was wrong (the relay, dispatch, and workflow all worked fine). The failures were in the agent's behavior inside the DAG nodes.

Each failure revealed a pattern of how LLM agents silently go wrong when you give them multi-step instructions:

Failure 1: Scope Creep

The investigate node was supposed to analyze errors and classify findings. Instead, it decided to be helpful and started creating Linear issues and opening PRs, jobs that belong to downstream nodes. The DAG's structure says "investigate first, then file tickets," but the agent doesn't see the DAG. It sees its instructions and a set of available tools. If the tools are there and the agent thinks it would be helpful to use them, it will.

Failure 2: Skipped Verification

The investigate node is supposed to check the issue tracker for duplicates before classifying a finding as "novel." In practice, the agent looked at the context from the gather node, saw that it already had enough information, and reasoned its way out of making any tool calls. It classified findings as novel based purely on its own judgment without actually searching. Result: duplicate tickets for an issue that already existed in Linear.

Failure 3: Tool Name Collision

SWEny exposes MCP tools like github_create_pr and linear_create_issue. But Claude Code also has its own built-in deferred tools like create_pull_request and get_issue. When the create_issue node was told to "not create PRs," it obeyed for the MCP tool but found the native tool with a different name and used that instead. The instruction was followed literally but not in spirit.

Failure 4: Missing Idempotency

The same Sentry alert can fire multiple times. On the second trigger, the create_issue node found the existing ticket from the first run but didn't know what to do with it. The node was written to create issues, not to handle the "already exists" case. The verify check then failed because no create tool was called.

Every one of these failures was silent. The agent completed its work, reported success, and moved on. Without structural enforcement, you wouldn't know anything went wrong until you checked the results manually.

The Verify Pattern

Prompt instructions alone don't prevent these failures. You can write "you MUST search the issue tracker" in bold caps, and the agent will still sometimes skip the search if it thinks it already has enough context. The solution is structural: verify post-conditions that check what the agent actually did, not what it said it did.

SWEny's workflow nodes support a verify block that runs after the agent completes. It inspects the tool call log and fails the node if required actions weren't taken:

triage.yml — verify blocks

nodes:
  investigate:
    name: Root Cause Analysis
    instruction: >-
      Classify findings as novel or duplicate.
      You MUST search the issue tracker before classifying
      anything as novel.
    verify:
      # If the agent made 0 search calls, it skipped the
      # novelty check entirely — fail and retry.
      any_tool_called:
        - linear_search_issues
        - github_search_issues

  create_issue:
    name: Create Issues
    instruction: >-
      Create Linear issues for novel findings.
      First check if a prior run already created a matching issue.
    verify:
      any_tool_called:
        - linear_create_issue
        - github_create_issue
        - linear_search_issues    # idempotency search
        - linear_add_comment      # +1 on duplicate

  create_pr:
    name: Open Pull Request
    instruction: >-
      Push the branch and open a PR using github_create_pr.
    verify:
      any_tool_called:
        - github_create_pr

The any_tool_called check is simple: at least one of the listed tools must have been called successfully during the node's execution. If none were, the node fails and gets retried with feedback about what was missing.

This is the key insight: you can't trust an LLM to follow process instructions reliably, but you can verify the artifacts it produced. Did it actually call the search tool? Did it actually create a ticket? Did it actually open a PR? These are binary checks on the tool call log, not subjective evaluations of output quality.

Scope Boundaries in Instructions

Verify catches omissions. For scope creep (doing too much) you need explicit boundaries in the instructions. Every node now ends with a scope block:

Scope boundaries

  investigate:
    instruction: >-
      ...analysis instructions...

      IMPORTANT — scope boundaries for this node:
      - DO NOT create issues. The create_issue node handles that.
      - DO NOT create branches, commits, or pull requests.
      - DO NOT call linear_create_issue, github_create_issue,
        create_pull_request, or github_create_pr.
      - Your ONLY job is read, search, classify, and output.

Note that both the MCP tool names (github_create_pr) and the native tool names (create_pull_request) are listed. You have to be explicit about both because the agent sees both in its tool inventory.

Idempotency: Same Alert, No Duplicate Tickets

Reactive triage creates a problem that scheduled triage doesn't have: the same error can trigger multiple webhooks. A spike of 500s might fire Sentry's alert rule three times in an hour. Without idempotency handling, that's three identical Linear tickets.

The fix is an idempotency check at the top of the create_issue node:

Idempotency in create_issue

  create_issue:
    instruction: >-
      For each NOVEL finding:

      1. First, check if a prior triage run already created an
         issue for this exact bug. Search the issue tracker with
         the error message or root cause.
         If a matching issue already exists:
         - DO NOT create a new issue.
         - Populate issueIdentifier, issueTitle, and issueUrl
           from the existing issue.
         - Add a "+1" comment if appropriate.
         - Set the action to "updated" in the issues array.

      2. If no matching issue exists, create a new one.

The verify block was widened to accept search and comment tools alongside create tools. The node passes whether it creates a new issue or finds an existing one. Both are valid outcomes.

The Full Picture

After five E2E test runs and four upstream framework PRs, the reactive triage pipeline looks like this:

What the Agent Did on the First Successful Reactive Run

Gather: Pulled the Sentry error details and recent Loki logs for the affected service. Checked recent commits and PRs for related changes.
Investigate: Made 10 tool calls, searching Linear for matching issues by error message, module path, and symptom. Found an existing ticket with the same root cause. Classified the finding as a duplicate.
Skip: Added a "+1, seen again" comment on the existing issue with new context from the latest occurrence.
Notify: Posted a summary. No new ticket, no PR, no noise. Exactly right.

The agent correctly identified a duplicate on its first reactive run. That's the verify pattern working: the structural check forced it to actually search before classifying, and the search revealed the existing ticket.

Setting Up Sentry Alert Rules

On the Sentry side, you need an Internal Integration and alert rules that POST to the Lambda's Function URL:

Create an Internal Integration in Sentry (Settings → Integrations → Internal). Give it read access to Issues and Projects. Copy the Client Secret for HMAC verification.
Add a Webhook URL: your Lambda Function URL.
Create Alert Rules per project. Set conditions that match your needs, e.g., "when a new issue is created" or "when an issue is seen more than 10 times in 1 hour." Use the Internal Integration as the action.

Tune your alert rules to avoid noise. If you fire a triage run on every single Sentry event, you'll burn through CI minutes fast. Debounce with conditions like "new issue" or frequency thresholds.