# Keploy MCP server

End-to-end API testing — generate and run tests from OpenAPI, curl, Postman, or real user traffic.

## Links
- Registry page: https://www.getdrio.com/mcp/io-github-keploy-mcp
- Repository: https://github.com/keploy/keploy
- Website: https://keploy.io/docs/running-keploy/agent-test-generation/

## Install
- Endpoint: https://api.keploy.io/client/v1/mcp
- Auth: Auth required by registry metadata

## Setup notes
- Remote header: Authorization (required; secret)
- The upstream registry signals required auth or secrets.
- Remote endpoint: https://api.keploy.io/client/v1/mcp
- Header: Authorization

## Tools
- bulkDeleteTestSuites - POST /apps/{appId}/test-suites/bulk-delete — Bulk-delete test suites — Requires scope: `write`. Endpoint: https://api.keploy.io/client/v1/mcp
- createAPIKey - POST /api-keys — Create an API key — Requires scope: `admin`. The raw key is returned only once in the response. Endpoint: https://api.keploy.io/client/v1/mcp
- createApp - POST /apps — Create an app — Requires scope: `write`. Endpoint: https://api.keploy.io/client/v1/mcp
- createCIBranch - POST /apps/{appId}/branches/ci — Find-or-create a CI branch — Creates a new Keploy branch (or returns the existing one) for a CI pipeline run. Idempotent on the (appId, name) pair. Requires scope: `write`. Endpoint: https://api.keploy.io/client/v1/mcp
- createMock - POST /apps/{appId}/recordings/{testSetId}/mocks — Author one mock under a recording — Insert a single mock into the given test set. When `branch_id` is
supplied, the mock lands on that branch's overlay
(`branch_sandbox_ops`) and only surfaces to main on merge.
Without `branch_id` the mock writes straight to main — same
behaviour as the recording-driven agent path.

Authoring shape — pick ONE:

- **`mock_yaml`** (PREFERRED) — paste the canonical mock YAML
  envelope (`version` / `kind` / `name` / `spec` with the
  per-kind payload, exactly as it lives in `mocks.yaml` on
  disk). The server decodes via OSS DecodeMocks so kind-
  specific Spec contents (`req`, `resp`, `metadata`, …)
  round-trip without field-name loss. This is the only path
  that preserves payloads pasted from existing mocks.
- **`mock`** — typed OSS Mock JSON object. Brittle: the OSS
  struct uses PascalCase JSON tags (`Metadata`, `Req`, `Res`),
  so lowercase canonical keys are silently dropped. Use only
  when authoring programmatically from typed Go shapes.

When both are sent, `mock_yaml` wins.

Requires scope: `write`.
 Endpoint: https://api.keploy.io/client/v1/mcp
- createTestSuite - POST /apps/{appId}/test-suites — Create a test suite — Requires scope: `write`. Endpoint: https://api.keploy.io/client/v1/mcp
- create_branch - Create a Keploy branch on an app — find-or-create on name conflict.

Pass the dev's CURRENT GIT BRANCH as the name. Detect it BEFORE calling this tool:
  Bash: `git rev-parse --abbrev-ref HEAD` in the dev's app_dir.
  Exit non-zero or output "HEAD" → not in a git repo / detached HEAD; ASK the dev for a name and re-call.

Find-or-create semantics: if a writable Keploy branch with that name already exists, it's returned (no error, no duplicate). Idempotent — safe to call on every retry.

Output: {branch_id, name, status, created} where created=true means a fresh branch was minted, created=false means an existing one was reused.

Pass the returned branch_id to subsequent write tools (create_test_suite, replay_sandbox_test, record_sandbox_test, etc.). Branch_id is sticky for the conversation — don't re-call create_branch unless the dev's git branch changed. Endpoint: https://api.keploy.io/client/v1/mcp
- create_mock - Author a single mock inside a test set. Use this when the dev wants to add a hand-written downstream-call recording (HTTP response, DB result, etc.) without re-running the full record session.

Authoring shape — pick ONE:
  * `mock_yaml` (PREFERRED) — canonical single-doc YAML envelope (version / kind / name / spec). Round-trips kind-specific contents losslessly.
  * `mock` — typed OSS Mock object { Name, Kind, Spec }. Brittle: lowercase canonical keys are dropped during decode; use only when authoring from typed Go shapes.

When both are sent, `mock_yaml` wins. `branch_id` is REQUIRED — direct writes to main via MCP are blocked. Resolve via list_branches / create_branch before calling.

Returns the persisted mock. Idempotent on (Name, test_set_id, branch) at the storage layer — a retry with the same name and branch returns the existing row instead of duplicating. Endpoint: https://api.keploy.io/client/v1/mcp
- create_test_suite - Create a new API test suite with test steps. Each step defines an HTTP request and assertions to validate the response. Steps can extract values from responses into variables for chaining requests.

═══════════════════════════════════════════════════════════════════
STEP 0 — read the canonical schema BEFORE drafting:
═══════════════════════════════════════════════════════════════════

If you've already called get_app_testing_context, the canonical step schema is in its response under the `step_schema` field — read it from there. Otherwise run `keploy test-suite-format` once before writing any suite JSON. The schema describes the MANDATORY rules below in detail plus the two-step prelude+POST skeleton you must follow. Authors who skip this and draft from training-data priors burn ~50s per validator rejection on iter 1.

═══════════════════════════════════════════════════════════════════
MANDATORY FOR EVERY STEP — the validator rejects on iter 1 if any of these are violated:
═══════════════════════════════════════════════════════════════════

  R10 — every step MUST carry a captured "response": {status, body, headers} block. Hit the endpoint
        locally before authoring (curl) and paste the real response. Steps with no response block are
        rejected outright; four downstream rules (R4 / R11 / R15-R16 / R27) silently no-op until R10
        is satisfied, so missing a response also hides every assertion / extract problem in that step.
  R9  — every POST / PUT / PATCH body MUST reference at least one {{var}} whose generator is declared
        on an EARLIER step's "extract" (typically a /health prelude as step 0). Without this, the
        second run collides on the first run's database state. App-level appLevelCustomVariables DO
        NOT qualify for R9 — the validator only credits step-level extracts.
  R2  — pre-request fields ("body", "url", "headers") CANNOT reference the CURRENT step's own
        "extract" outputs. Extract runs AFTER the response comes back; pre-request substitution sees
        nothing yet. Together R9 + R2 force the prelude pattern: declare generators on step 0, use
        them from step 1+. The STEP SHAPE example below shows the canonical two-step layout.
  R15 — every assertion's path / status / header MUST resolve against the AUTHORED response block.
        JSONPath uses gjson dot-array syntax: $.orders.0.id  — NOT $.orders[0].id (the bracket form
        does not resolve in gjson; the assertion is rejected as "key not present in recorded body").
        For status_code / header_* assertions, the values must match what's in response.status /
        response.headers verbatim — capture the real response via curl before authoring.
  R32 — every step-level extract key MUST NOT collide with the app's appLevelCustomVariables
        (enumerate them via get_app_testing_context or getApp before authoring). The runtime's
        variable lookup resolves app-level first, so a colliding key means the suite's function
        silently never runs. Suite-suffix when in doubt: userNonceForSuite, not genUserId. Don't
        invent a parallel generator with the same name as an existing app-level one.

═══════════════════════════════════════════════════════════════════
APP CONFIG FIRST — read the app before authoring:
═══════════════════════════════════════════════════════════════════

Before any other step, call getApp({app_id}) and read these fields:

  * appLevelCustomVariables — dynamic generators and static fixtures pre-configured by the dev, shared across every suite for this app. Common shapes:
      - genUserId, genProductName (JS functions returning fresh entropy per run, e.g. `alice_<rand1-10000>`)
      - staticUser (a fixed user the dev wants tests to use)
      - zeroQuantity, negativePrice, invalidUser (static fixtures for validation tests)
    PREFER these over inventing your own JS-function in `extract`. They're the dev's authoritative dynamic-input set — using them in POST/PUT/PATCH bodies via `{{varName}}` means each replay hits a fresh row, sidestepping duplicate-key errors. Inventing a parallel generator with the same intent risks name-collision rejection (see Name-collision check below).
  * auth — the auth shape suites must satisfy (header / cookie / oauth / none).
  * ignoreEndpoints, rateLimit, timeout — runtime knobs that shape what assertions can hold.

If a relevant `gen*` already exists in appLevelCustomVariables, ALWAYS reference it via `{{name}}` rather than authoring a parallel one. The dev configured it for a reason.

═══════════════════════════════════════════════════════════════════
BEFORE CREATING — check for duplicates AND for existing recordings:
═══════════════════════════════════════════════════════════════════

A (app_id, branch_id) tuple holds at most one suite per scenario, AND if the dev has already captured the relevant traffic via `keploy record`, you should seed from that recording instead of curling the app fresh. Two bounded checks before create_test_suite:

(1) Duplicate-suite check — call listTestSuites({app_id, branch_id, q: "<scenario-keyword>"}) where <scenario-keyword> is a substring of the name you're about to author (e.g. "checkout", "auth"). The server filters by name regex, so the response is bounded to relevant matches regardless of how many suites the app has. If you can't pick a keyword (the dev's intent is vague), call with page_size=20 and NO q, then scan the first page only — DON'T paginate further.
    Match by name (case-insensitive) AND by intent. If any existing suite covers the same scenario:
      - Same scenario, refresh wanted → call update_test_suite (preserves history) or delete_test_suite + create_test_suite (loses history).
      - Adjacent but distinct scenario (e.g. "checkout with discount" vs "checkout without discount") → create with a name that distinguishes them clearly.

(2) Recording-reuse check — call listRecordings({app_id, limit: 10}) to fetch the 10 most recent `keploy record` sessions. Recordings cluster by scenario; the top 10 cover what's likely relevant — DON'T paginate the full history. For any recording whose name/timestamp suggests it covers the scenario you're authoring, call download_recording({app_id, test_set_id}) to pull its captured test cases (real request/response pairs from the live app). Seed your steps_json from those test cases — convert each one into a step (method/url/headers/body → request fields; recorded response → step's `response` field). This is more faithful than re-curling and saves the dev's time.
    If no recent recording covers the scenario, fall through to the normal validate-locally-before-inserting flow (curl each endpoint yourself).

(3) Only after both checks → proceed with create_test_suite.

Skipping (1) leaves the dev with two suites covering the same flow — confusing reports, double rerecord cost, and orphaned sandbox tests on whichever suite they stop using. Skipping (2) re-curls endpoints whose traffic the dev already captured.

═══════════════════════════════════════════════════════════════════
ONE SCENARIO PER SUITE — load-bearing constraint:
═══════════════════════════════════════════════════════════════════

A suite represents EXACTLY ONE user-facing scenario / use-case (e.g. "user registers and creates their first order", "admin promotes a user role", "checkout with discount applied"). Do NOT pack multiple unrelated scenarios into a single suite — every step in a suite shares state and ordering with every other step. Mixing scenarios breaks idempotency (cleanup for one scenario can wipe state another scenario assumed), makes failures harder to diagnose, and inflates rerecord cost.

Tests for "auth + payments + cleanup" → THREE suites, not one. Related steps that share extracted vars and a state assumption belong in the same suite; unrelated flows don't.

When in doubt: if you can't write a single sentence describing what the suite tests in user-facing terms, split it.

═══════════════════════════════════════════════════════════════════
IDEMPOTENCY CONTRACT — the load-bearing rule for every suite:
═══════════════════════════════════════════════════════════════════

Every suite MUST be replayable indefinitely without state drift. The same suite run twice in a row, or 100 times back-to-back, must produce the same per-step outcomes. Failing this makes the suite useless for sandbox replay (the captured mocks freeze a single point-in-time response, so any state-dependent step diverges on rerun).

How to design for it:
  * Duplicate-key 500 on POST / PUT / PATCH replay ("duplicate key" / "already exists" / "unique constraint violated") is ALWAYS a SUITE design problem, NEVER an app problem. Fix order:
      (1) Reference an app-level `gen*` var via `{{name}}` in the body — works if one exists (you read appLevelCustomVariables in APP CONFIG FIRST).
      (2) If no fitting app-level generator, declare your own JS-function in a PRIOR step's `extract` (see PRELUDE PATTERN); reference it via `{{name}}` in the failing step's body.
      (3) Add a DELETE cleanup step earlier in the suite to clear the conflicting row.
    NEVER propose modifying app code (e.g. adding `ON CONFLICT` to the INSERT, retry loops, transactional wrappers). The app's dedup is correct; the suite is what's missing entropy. See DO NOT MODIFY APP SOURCE CODE below.
  * If a step CREATES a resource, a later step in the same suite MUST clean it up (DELETE the row, revert the state) — OR the create must be idempotent on the server side (PUT-by-key, upsert). A naked POST that always allocates a new ID will diverge on every replay.
  * If a step depends on a resource, EXTRACT its identity from a prior step's response into a `{{var}}` — never hard-code an ID that "happens to exist right now". Hard-coded IDs rot.
  * Reject "natural-language idempotency" reasoning ("the dev will reset the DB before each run"). The suite must work without external setup. If you can't guarantee it, you've packed two scenarios into one suite — split them.
  * Do not assume time-of-day, ordering relative to other suites, or random-but-stable values. Each suite is its own universe.
  * Pagination / list endpoints: extract the count or a known item, don't assert on absolute indices ("the third item is X") — index drifts as the dataset grows.
  * Auth tokens: pull from app-level custom variables or extract from a login step IN THE SUITE. Never inline a token that expires.

If the dev's request implies non-idempotent behaviour (e.g. "create user, then test that creating the same user fails"), capture both states explicitly inside the suite — first step creates, second step asserts the conflict response, third step deletes — so the suite as a whole is still replayable. Don't push the cleanup outside the suite.

A suite that fails idempotency is rejected at create_test_suite time by the dynamic validator (2 live runs check). When that fails, do NOT retry by tweaking syntax — restructure the scenario.

═══════════════════════════════════════════════════════════════════
DO NOT MODIFY APP SOURCE CODE during suite authoring:
═══════════════════════════════════════════════════════════════════

At create_test_suite time, your job is to author a suite that fits the app AS IT IS. You may patch the app's CONFIG (auth, appLevelCustomVariables, ignoreEndpoints, rateLimit) via updateApp({app_id, ...}) — those are runtime knobs the dev expects to tune. You may NOT modify the app's SOURCE CODE.

If a step is failing because of how the app behaves (500s, contract mismatches, missing endpoints, validation errors), the response is ONE of:
  * Adjust the suite to match observed behavior (steps_json edits before insert).
  * Use an app-level dynamic var (see APP CONFIG FIRST) or a JS-function generator to avoid the failure (see IDEMPOTENCY CONTRACT's duplicate-key fix order).
  * Patch the app's CONFIG via updateApp if the cause is auth / vars / rate limit.
  * If the dev confirms the app is broken AND the suite is correct, ASK the dev to fix the app — do NOT propose code changes yourself during authoring.

NEVER propose ON CONFLICT clauses, retry loops, transactional wrappers, or any code-level change to the dev's application as a way to make the suite work. The suite must accommodate the app, not the other way around.

═══════════════════════════════════════════════════════════════════
NEVER-MISS-THESE — the validator HARD-REJECTS suites missing any of these:
═══════════════════════════════════════════════════════════════════

  1. `response` on EVERY step — { status: <int>, headers: {…}, body: "<string>" }.
     Captured from a real curl against the dev's app. body MUST be a JSON-encoded STRING (the raw body bytes), NOT a parsed object. Wrap with json.dumps / JSON.stringify if your tool gave you a dict.

  2. `extract` is the ONLY authoring slot — never `extract_variables`.
     `extract_variables` is a post-run runtime SNAPSHOT field; the extract_variables-input rejection rule hard-rejects it on input. If you read an existing suite via getTestSuite / get_app_testing_context / download_recording and see `extract_variables` populated there — IGNORE IT, that's the runtime's display state, not the suite's input. Always author with `extract`.

  3. POST/PUT/PATCH bodies need a per-run dynamic `{{var}}` (mutating-step dynamism check).
     Declare a JS-function generator on a PRIOR step's `extract` (typically a /health prelude as step 0). The declare-and-use-same-step check forbids declaring-and-using on the same step. See PRELUDE PATTERN below.

  4. JSONPath uses gjson dot-array syntax: `$.orders.0.user_id` — NOT `$.orders[0].user_id`.

═══════════════════════════════════════════════════════════════════

STEP SHAPE (steps_json is an ARRAY — copy this two-step skeleton verbatim, preserve the prelude pattern):
  [
    {
      // Step 0: cheap read prelude. Its sole job is to declare JS-function generators that
      // later POST/PUT/PATCH bodies reference. Required by R9 (mutating bodies need a per-run
      // dynamic var) + R2 (same-step extract isn't usable in pre-request fields). If your
      // suite already has a natural read step (/health, /me, version), reuse it as the prelude.
      "name": "health prelude (declares generators)",
      "method": "GET",
      "url": "/health",
      "headers": { "Accept": "application/json" },
      "extract": {
        "genUserId": "function genUserId(){return 'u_'+Date.now()+'_'+Math.random().toString(36).slice(2,8);}"
      },
      "assert": [
        { "type": "status_code", "expected": "200" },
        { "type": "json_equal", "key": "$.status", "expected": "healthy" }
      ],
      "response": {
        "status": 200,
        "headers": { "Content-Type": "application/json" },
        "body": "{\"status\":\"healthy\"}"
      }
    },
    {
      // Step 1: the actual mutation. Body references {{genUserId}} from the PRIOR step's
      // extract — satisfies R9 (per-run dynamic var) and R2 (not same-step). This step's
      // own "extract" captures the SERVER's response value (JSONPath) so a later step can
      // chain to {{user_id}} — JSONPath captures on the same step ARE legal because they
      // resolve post-response and only matter for subsequent steps.
      "name": "create user",
      "method": "POST",
      "url": "/api/users",
      "headers": { "Content-Type": "application/json" },
      "body": "{\"name\":\"{{genUserId}}\"}",
      "extract": {
        "user_id": "$.data.id"
      },
      "assert": [
        { "type": "status_code", "expected": "201" },
        // assert a STATIC field of the response, not a dynamic one. R30
        // forbids {{genUserId}} in assert.expected (the runtime would
        // re-evaluate the function at assertion time and the value
        // wouldn't match the body's earlier call). Pick something the
        // server always returns the same — here the literal "status"
        // field. To assert against the dynamic id the server minted,
        // capture it via extract (above) and reference {{user_id}} in
        // a LATER step's assertion or url, not this step's.
        { "type": "json_equal", "key": "$.status", "expected": "created" }
      ],
      "response": {
        "status": 201,
        "headers": { "Content-Type": "application/json" },
        "body": "{\"data\":{\"id\":\"abc-123\",\"name\":\"u_1700000000_xyz\"},\"status\":\"created\"}"
      }
    }
  ]

VALID assertion types (ONLY use these — anything else fails the step at runtime with "invalid assertion type"):
  * status_code          — exact HTTP status match.      {type, expected:"201"}
  * status_code_class    — match by class 2xx/3xx/…     {type, expected:"2xx"}
  * status_code_in       — any of a set. DELETE STEPS ONLY (status_code_in-scope check). For POST/GET/PUT/PATCH this is rejected. If you reach for it to absorb a duplicate-key 500 on re-runs, the right fix is a JS-function {{var}} in the body (see `extract` rules below) so each run hits a fresh row and only 201 is ever returned.   {type, expected:"200,201,204"}
  * header_equal         — response header exact match.  {type, key:"Content-Type", expected:"application/json"}
  * header_contains      — header value substring.       {type, key:"Location", expected:"/orders/"}
  * header_exists        — header is present.            {type, key:"X-Request-Id"}
  * header_matches       — header regex.                 {type, key:"Etag", expected:"^W/\\\".+\\\"$"}
  * json_equal           — response body JSON path text match (string-compares the value at the path; type-blind, so number 2 matches string "2" and ["pending"] matches "pending"). {type, key:"$.order.status", expected:"created"}
  * json_strict_equal    — response body JSON path TYPE-STRICT deep-equal. Catches shape mutations json_equal misses: number↔string-of-number, bool↔string-of-bool, null↔"", scalar↔single-element array. `expected` MUST be a JSON-typed literal (NOT a quoted string) for non-string types: `expected: 2` (number), `expected: true` (bool), `expected: null`, `expected: ["a","b"]`, `expected: "hello"` (string). Use this when the test is meant to catch wire-shape regressions. {type, key:"$.order.amount", expected: 99.5}
  * json_contains        — response body JSON path substring/partial. {type, key:"$.message", expected:"success"}
  * custom_functions     — inline JS function: (request, response, variables, steps) => boolean.
                           {type, expected:"function f(request,response){return response.status===201;}"}

DO NOT use any assertion type not in the closed list above. The set is fixed at exactly 11 entries — there are no wildcards. These are types AIs commonly invent that DO NOT EXIST in keploy and will fail the assertion-type closed-list check:
  ✗ json_type         — there is no type-of check; assert against the literal value via json_equal, or use custom_functions with a typeof predicate.
  ✗ json_path         — paths are passed via the "key" field of json_equal / json_contains; there is no separate path-only type.
  ✗ json_schema       — no schema validation; closest is custom_functions with an inline schema check.
  ✗ json_array_length — no length-only assertion; capture .length via extract, or use custom_functions.
  ✗ header_starts / header_ends — only header_equal, header_contains, header_exists, header_matches (regex) exist.
  ✗ status_in / status_range — the real names are status_code_in / status_code_class.
  ✗ body / body_equal — no body-level type; assert against parsed paths via json_equal / json_contains, or use custom_functions.

Anything not literally in the bulleted list above will get rejected by the validator — don't extrapolate from prefixes.

`expected` values must be STRINGS (put numbers like 201 in quotes). `expected_string` is auto-populated; you can omit it.

VARIABLES — purpose-first: a suite is a SCENARIO CHAIN; variables carry continuity between steps. Step N creates or fetches a resource → extracts its identity into a named var → step N+M uses `{{var}}` to reference that identity. If you find yourself extracting a value that NO LATER STEP references, DROP the extract — it's noise that hides which fields actually drive the scenario. Mechanically: extract values from one step's response with `extract: {varname: "$.path"}` (JSONPath). Reference later with `{{varname}}` in headers, body, url, or assertion `expected` values.

STEP IDS & TRACKING HEADERS are auto-injected — don't provide them. The server assigns a UUID per step and adds X-Keploy-Test-Step-ID / X-Keploy-Test-Suite-ID / Keploy-Test-Name so the sandbox runner can correlate responses to steps.

VARIABLE RULES (the runner follows these exactly — see pkg/service/atg/customFeatures.go ResolveCustomVariables):
  * Syntax: {{name}} — regex matched: {{(\w+)}} (letters, digits, and underscore only; NO whitespace, hyphens, or dots inside the braces). Names like {{gen-user}} or {{gen.user}} will NOT be substituted — use {{gen_user}} instead.
  * Substitution happens in: url, body, headers values, AND assertion "expected" values (so an assertion expecting {{genUserId}} gets the SAME resolved value the body used).
  * Resolution sources (looked up in this order):
      1. The step's own `extract` map (seeded into the vars pool at step entry — pkg/service/atg/core.go:3698).
      2. Variables produced by EARLIER steps' `extract` maps (post-response JSONPath captures).
      3. App-level custom variables (stored on the app record, shared across all suites).

THE `extract` FIELD IS THE ONLY AUTHORING SLOT — use it for BOTH static values and JS-function generators.

  `extract_variables` IS NOT AN AUTHORING SLOT. It's a post-run runtime SNAPSHOT — the runner writes resolved {{var}} values there after each step executes so the UI can show what landed at runtime. **The validator now HARD-REJECTS any step with `extract_variables` populated (extract_variables-input rejection).** If you see `extract_variables` while reading an existing suite via getTestSuite / download_recording / get_app_testing_context, IGNORE IT — that's the runtime's display state, not the suite's input. To author the equivalent, put every entry into `extract` instead (same keys, same values: JSONPath strings stay JSONPath, JS-function strings stay JS).

TWO SHAPES THE `extract` FIELD ACCEPTS — pick the right one:

(a) JSONPath capture — `"order_id": "$.order.id"`
    Evaluated against the step's recorded response.body after the request returns. The captured value is staged into vars for LATER steps to reference via {{order_id}}. Use this when the value you need is in the server's response.

(b) Inline JS-function generator — `"genUserId": "function genUserId(){ return 'alice_' + Date.now() + '_' + Math.random().toString(36).slice(2,8); }"`
    The value string must contain the keyword `function` — that is how the runner (core.go:4107 isInlineJs branch) distinguishes JS from a JSONPath. Signature: function <name>(steps) { ... return '<string>'; } returning a string. The `steps` arg is a map of prior-step {request,response} snapshots; ignore it if unused. Use this for inputs that must be unique per run (user_ids, timestamps, uuids) so the suite stays idempotent on re-runs against the same DB.

    Examples:
      {"genUserId":  "function genUserId() { return 'alice_' + Date.now() + '_' + Math.random().toString(36).slice(2,8); }"}
      {"genTs":      "function genTs() { return String(Date.now() * 1e6 + Math.floor(Math.random()*1e6)); }"}
      {"genOrderId": "function genOrderId(steps) { return 'ord-' + Math.random().toString(36).slice(2,10); }"}

Put the JS-function entry on the FIRST step that needs it (often a health-check step whose own body doesn't reference the var — that's fine, the seed fires on step entry regardless). Later steps reference `{{genUserId}}` in body/url/headers/assertion-expected and see the same resolved value within one run, a fresh value on the next run.

WHY THIS MATTERS (the mistake to avoid): if you pin a static user_id like "alice_1776638347063146000" into `extract` AND your validation curl ALREADY inserted that row, the very next record/sandbox replay run will fire the same POST body, the producer (with deterministic ids or unique-constraint indexes) will reject the duplicate with a 500, and every downstream $.order.* / $.shard.* assertion will hit <missing>. Fix: use a JS-function entry so every run gets a fresh user_id.

VARIABLE CHAINING (JS-function generator on step 1, JSONPath capture for step-2 chaining):
    step 1:
      body: {"user_id":"alice_{{genUserId}}","product_name":"Keyboard_{{genUserId}}"}
      extract: {
        "genUserId": "function genUserId(){ return Date.now() + '_' + Math.random().toString(36).slice(2,8); }",
        "order_id":  "$.order.id"
      }
      assertions: [
        {type:"status_code", expected:"201"},
        {type:"json_equal", key:"$.order.user_id", expected:"alice_{{genUserId}}"}   -- SAME resolved value as the body used
      ]
    step 2:
      url: /api/orders/{{order_id}}                  -- resolves from step 1's JSONPath extract at run time
      assertions: [ {type:"json_equal", key:"$.id", expected:"{{order_id}}"} ]

PRELUDE PATTERN — when MULTIPLE POST steps each need their OWN per-run dynamic var:

A common mistake is to put the JS-function generator on the SAME step that uses it in the request body. The declare-and-use-same-step check rejects this — same-step `extract` is post-response, so its values aren't in scope when the request fires. Pattern that works: declare the generator on an EARLIER step's `extract` (typically a cheap /health GET as a "prelude"). The runner seeds extract values at STEP ENTRY, so a generator on step 0 is in scope from step 0 onwards — every later POST can reference it.

    step 0 — prelude (the extract entry is what matters; the step itself can be anything cheap):
      method: GET, url: /health
      extract: { "uniq": "function uniq(){ return 'p_'+Date.now()+'_'+Math.random().toString(36).slice(2,8); }" }
      assertions: [ {type:"status_code", expected:"200"} ]

    step 1, 2, 3 — POSTs that all reference {{uniq}}:
      method: POST, url: /api/orders
      body: '{"user_id":"alice","product_name":"{{uniq}}",...}'
      // (no `extract` needed — the generator is in scope from step 0)

The prelude itself doesn't need to USE the var; declaring it is enough. This is the right shape for "create N orders with different unique keys" — without the prelude, you'd hit the declare-and-use-same-step check on every POST that tries to declare-and-use a generator on the same step.

VALIDATE-LOCALLY-BEFORE-INSERTING (CRITICAL for a usable suite):
DO NOT call this tool with raw un-tested steps. EVERY step you send MUST have its "response" and "extract" fields populated from a live run. These are NOT optional. Without them:
  - The UI cannot render the step (shows an empty panel).
  - The rerecord runs blind and fails.
  - The step's {{variables}} won't resolve.

Required per-step fields when calling this tool (in steps_json):
  • name, method, url, headers, assert — obviously
  • body — for POST/PUT/PATCH; MUST reference random inputs as {{varname}} placeholders, NOT inline timestamps. Inline timestamps get baked into the suite and collide on re-run.
  • extract — MUST contain a resolvable entry for every {{varname}} you reference in body/url/headers. JS-function entries are fine (they're the canonical "dynamic input" shape); JSONPath entries chain values from one step's response into later steps. Example: body has {"user_id":"alice_{{genUserId}}"} → step's extract must have {"genUserId":"function genUserId(){ ... }"}.
  • response — the raw captured response from your local curl. Shape: {"body":"<raw string>","status":201,"headers":{"Content-Type":"application/json",...}}.

MANDATORY validate-locally flow (do this BEFORE calling create_test_suite):
  1. Bring the dev's app up locally (Bash: docker compose up -d, or instruct the dev). Wait for /health readiness.
  2. For EACH step in order (simulating what the runner will do):
      a. For dynamic inputs (user_id, timestamps, uuids): DON'T inline a value — write a JS function into the step's `extract` map, e.g. {"genUserId":"function genUserId(){return 'alice_'+Date.now()+'_'+Math.random().toString(36).slice(2,8);}"}, and reference it in body/url/headers/assertion-expected as {{genUserId}}.
         For the local curl, you still need a CONCRETE value for that run — so as you're building the step, JS-eval the function yourself (or just pick a value consistent with the function's output shape) to do ONE concrete local curl for capturing the response. That concrete value goes ONLY into the captured "response" body — NOT into `extract` (which keeps the JS function verbatim).
      b. Substitute {{name}} everywhere in the step's url/body/headers using accumulated variables (this step's extract + earlier steps' extract results).
      c. curl the SUBSTITUTED request against the live app. Capture the response.
      d. Check each "assert" against the captured response. If any fails → regenerate (different inputs / loosen assertion / change body shape) and retry this step. DO NOT move on with a failing step.
      e. Save the captured response into the step's "response" field as {"body":"<raw string>","status":<int>,"headers":{...}}.
      f. If the step has JSONPath entries in `extract`, evaluate each path against the response and note those values so later steps can use them in their {{var}} substitutions.
  3. AFTER the first pass of all steps, run the WHOLE SEQUENCE a SECOND TIME against the same live app — no DB wipe in between. Because you're using JS-function generators, each run should pick fresh random inputs → no unique-constraint collision. If ANY step that passed the 1st run fails the 2nd run (common symptom: 500 "failed to save order" / "duplicate key" / "already exists"), the suite is NOT idempotent. Go back to step 2a and either (i) increase the entropy of the JS function, (ii) restructure the step to be READ-AFTER-WRITE instead of POST-then-POST, (iii) drop the step if it genuinely can't be made idempotent.
  4. Only once every step passes BOTH validation runs → call create_test_suite. Pass each step's response + extract (with JS functions verbatim) through steps_json.

If you call create_test_suite without response + extract on every step, you are creating a suite that is broken by construction. The UI + rerecord WILL fail.

SELF-CONTAINED TESTS (required for repeat runs):
  * The suite will be re-run many times (local validation + record + sandbox replay + ad-hoc UI runs). It must not depend on prior state.
  * Put random inputs in `extract` JS functions with high entropy (timestamps, uuid). Plain "alice" will collide on re-run against producers with dedupe.
  * Prefer READ-AFTER-WRITE chaining: POST creates resource → extract id → GET uses id. Validates without depending on PRE-EXISTING ambient seed data (rows that "happen to exist" in the dev's DB).

SEED → TESTED → CLEANUP roles within a suite:

When the scenario is "user can read X" or "user can list X", you can't assert against ambient state — there's no guarantee the X exists when the suite runs. Pattern:

  step seed:    POST /X (with dynamic {{var}} body)  → extract id
  step tested:  GET /X (or list)                      → assert against the seeded id
  step cleanup: DELETE /X/{{id}}                      → restore baseline

Only the "tested" step is what the suite is FOR; the seed and cleanup steps are scaffolding so the test can run any number of times against any starting state. Without an explicit seed step, you're either testing nothing (empty list) or relying on pre-existing data the next replay won't have.

DO NOT assert "the user has 3 orders" — that's ambient state. Seed N orders inside the suite first, then assert the count. Same applies to any "list / search / count" scenario: seed the data the test depends on, never assume it's there.

═══════════════════════════════════════════════════════════════════
SUBSTITUTION RULES — where {{var}} is and isn't allowed
═══════════════════════════════════════════════════════════════════

`extract` values come in two flavours and they substitute very differently:

  TYPE A — JS-function generator (`extract: { genUserId: "function genUserId(){return 'apple_'+Date.now()}" }`):
     The runtime STORES the source string and RE-RUNS the function on EVERY `{{genUserId}}` substitution site. Each call returns a fresh value. ONLY safe in POST / PUT / PATCH request BODIES — that's the one place you actually want a fresh value (uniqueness for inserts).

  TYPE B — JSONPath extract (`extract: { createdId: "$.order.user_id" }`):
     Evaluated ONCE against the step's recorded response, the resulting STRING is stored. Subsequent `{{createdId}}` substitutions resolve to that same fixed string. Safe everywhere — URLs, assertions, downstream bodies.

ALLOWED placements for `{{generatorFn}}` (TYPE A):
  ✓ POST / PUT / PATCH request body — the canonical "give me a fresh value to insert" use case.

FORBIDDEN placements for `{{generatorFn}}` (TYPE A) — the validator REJECTS these (generator-placement checks):
  ✗ `assert[*].expected` — the assertion's expected value will be a fresh function call, NOT the value the body sent. Static literals or `{{TYPE_B_extract}}` only.
  ✗ GET / DELETE / HEAD / PUT / PATCH URL (path or query) — those target an existing resource; the URL must encode the SAME id the creating POST used. Use a TYPE B extract from the creating step.
  ✗ Path/query of a downstream step's URL when the value should match what the upstream step inserted — same reason.

CANONICAL PATTERN — read-after-write with a stable id:
  Step 0 (POST creates resource):
    body:    {"user_id":"{{genUserId}}","name":"widget"}      // TYPE A in body — fresh per run, OK
    extract: {"genUserId":"function genUserId(){...}",       // TYPE A — body source
              "createdId":"$.order.user_id"}                  // TYPE B — capture server's stored id
    assert:  [{type:"status_code", expected:"201"},
              {type:"json_equal", key:"$.order.id", expected:"{{createdId}}"}]   // TYPE B in assertion — stable

  Step 1 (GET reads it back):
    url:     "/api/orders?user_id={{createdId}}"              // TYPE B in GET URL — stable
    assert:  [{type:"json_equal", key:"$.orders.0.user_id", expected:"{{createdId}}"}]    // TYPE B — stable

DO NOT do this (the validator will reject it):
  Step 0:
    body:    {"user_id":"{{genUserId}}"}
    assert:  [{type:"json_equal", key:"$.order.user_id", expected:"{{genUserId}}"}]   // generator-placement check — TYPE A in assertion
  Step 1:
    url:     "/api/orders?user_id={{genUserId}}"              // generator-placement check — TYPE A in GET URL

Name-collision check — do NOT pick an `extract` key that already exists on the app's appLevelCustomVariables. Use get_app_testing_context (or check the app payload) to enumerate them first; if a collision is unavoidable, scope-suffix your key (e.g. `genUserId_smokeTest`). Otherwise the runtime resolves the app-level variable first and silently shadows the suite's extract.

You can also construct steps from data fetched via download_recording or get_app_testing_context, but the validate-locally-before-inserting rule still applies.

CRITICAL — READING EXISTING SUITES: when the data you fetch via getTestSuite / get_app_testing_context / download_recording shows steps with `extract_variables` populated, that's the runtime's POST-EXECUTION SNAPSHOT (resolved values the runner wrote back for UI display). It is NOT what was authored. Treating that field as a copy-paste template makes the validator's extract_variables-input rejection reject every suite you produce. To replicate the authored behavior: copy every entry into `extract` instead, preserving keys and values verbatim (JS-function strings stay JS, JSONPath strings stay JSONPath). When in doubt: `extract_variables` is read-only output state; `extract` is input.

===== FROM-SCRATCH SCOPE RULE =====

When the dev asks to "generate / create / add / build keploy tests" without narrowing the scope, DEFAULT = ALL ENDPOINTS. Enumerate every non-trivial endpoint the app exposes (OpenAPI spec, router code, handler files) and author ONE suite per logical grouping — e.g. "user-crud", "auth-flow", "order-happy-path", "order-validation-errors". A single-endpoint app might produce one suite; a typical microservice produces 3-8.

Groupings should be READ-AFTER-WRITE coherent (each suite's steps chain via extract variables rather than depending on outside state). TELL THE DEV up-front how many suites you're about to create and what each covers, then proceed — do NOT ask for confirmation mid-flow. If the dev explicitly narrows it ("just the happy path for orders", "only the auth flow"), honor that.

===== HARD RULE — NO DB-STATE-DEPENDENT STEPS =====

Do NOT include any step whose response body depends on total DB / queue / file-system state. Concretely: GET /items with no filter, GET /orders with no user_id filter, any "list-all" / "count" / "search" that returns more rows the longer the app has been running, any endpoint returning the CURRENT time / UUID / request-id.

The auto-replay after record byte-compares the recorded response with what the live app returns under mocks, and non-deterministic bodies make gate 2 skip → the suite is never linked → sandbox replay fails with "no sandboxed tests". If you find yourself reasoning "this list might vary slightly but the mock should handle it" — STOP and drop the step.

The suite should contain ONLY steps whose response body is fully determined by that step's own request: health checks, create-with-fresh-ids, read-back-by-id for the ids you just minted, validation-error 400s on bad payloads. Filtered reads using a freshly-extracted id are fine; unfiltered reads are not.

===== SERVER-SIDE IDEMPOTENCY ENFORCEMENT =====

This tool REJECTS (with a typed error, before creating anything) any suite whose mutating steps aren't idempotent on re-run. The rule:

  For each step with method ∈ {POST, PUT, PATCH} and a non-empty body, the body MUST contain at least one `{{name}}` placeholder that resolves to a per-run dynamic value. "Dynamic" means one of:
    (a) a JS-function entry in any step's `extract` map (e.g. {"genUserId": "function genUserId(){ return 'u_'+Date.now()+'_'+Math.random().toString(36).slice(2,8); }"}), OR
    (b) a JSONPath extract output from an EARLIER step (transitively dynamic if that step's own body was idempotent).

If the endpoint is GENUINELY idempotent (e.g. POST /auth/refresh, PUT /tags/apply-same-input — repeat calls don't hit unique constraints) set `"idempotent": true` on the step to waive the check. Use this sparingly — the default is "assume it'll collide" because that's the common case.

On rejection the error names the offending step and lists the dynamic variable names already in scope so you can see what's wireable. Fix the suite and retry — do NOT just flip `idempotent: true` to bypass.

===== MANDATORY OUTPUT — Phase 1 section =====

After all create_test_suite calls in a FROM-SCRATCH flow succeed, your final message to the dev MUST contain a section with this exact heading (do NOT collapse into prose; emit even for a single suite):

    ### Phase 1 — Inserted suites
    | Suite name | suite_id | Step count |
    | --- | --- | --- |
    | <name> | <suite_id> | <N> |

One row per suite created in this flow. Next step after Phase 1 is record_sandbox_test (see its description for Phase 2).

===== HOW THIS TOOL ACTUALLY INSERTS THE SUITE =====

This tool DOES NOT POST the suite to api-server itself. It returns a "playbook" — a small array of shell steps for you (Claude) to walk via Bash. The playbook spawns the enterprise CLI `keploy create-test-suite` which:
  1. Reads the suite JSON the playbook wrote to disk.
  2. Runs every static structural check — exits 1 with violations on stdout if anything fails.
  3. Fires the suite against the dev's local app TWICE (idempotency check) — exits 1 if the second run diverges from the first.
  4. Runs dynamic checks (generator-dynamism + GET-coupling) — exits 1 with violation messages on failure.
  5. POSTs the validated suite to api-server (HTTP 201 → success; HTTP 426 → CLI is older than api-server's rule set, dev needs to upgrade `keploy`).

Walk the playbook in order. If step 2 (the CLI run) exits non-zero, surface its stdout to the dev — it lists the offending step / check / fix-it hint and includes a canonical step skeleton on structural failures. ITERATE LOCALLY: revise the JSON in your draft, REWRITE the same suite file via Bash, and RE-RUN step 2 directly. DO NOT call create_test_suite again per iteration — that mints a fresh playbook and a new nonce-path for no reason; the existing one is reusable. The CLI ALSO requires every step to have `response` and `extract` populated (step completeness check plus the validate-locally rules above), so the validate-locally curl flow described earlier is still required BEFORE calling this tool.

PREREQUISITES the playbook assumes:
  * The dev's app is up and reachable at app_url.
  * `keploy` binary is on PATH. If missing, install before calling this tool: `curl --silent -O -L https://keploy.io/install.sh && source install.sh`.
  * Either ~/.keploy/cred.yaml exists (API key) or KEPLOY_API_KEY is exported. The CLI uses the API key for the api-server POST (different from the OAuth-JWT path the sandbox tools use). Endpoint: https://api.keploy.io/client/v1/mcp
- deleteApp - DELETE /apps/{appId} — Delete an app — Requires scope: `admin`. Endpoint: https://api.keploy.io/client/v1/mcp
- deleteMock - DELETE /apps/{appId}/recordings/{testSetId}/mocks/{mockId} — Drop one mock — Idempotent — returns 200 even if the mock is already gone. Path
`{mockId}` accepts both the UUID `_id` and the human-readable
Name (resolved within the test set).

Branch-aware via optional `branch_id` query param.

Requires scope: `write`.
 Endpoint: https://api.keploy.io/client/v1/mcp
- deleteTestSuite - DELETE /apps/{appId}/test-suites/{suiteId} — Delete a test suite — Requires scope: `write`. Endpoint: https://api.keploy.io/client/v1/mcp
- delete_mock - Drop one mock from a test set. Idempotent — returns 200 even if the mock is already gone, so retries on a network blip are safe.

Accepts UUID or Name as `mock_id`. `branch_id` is REQUIRED — the delete lays a tombstone overlay on the branch (mergeable). Direct deletes from main via MCP are blocked. Endpoint: https://api.keploy.io/client/v1/mcp
- delete_recording - Wholesale-delete a recording (test set + its cases + mocks + mapping). `branch_id` is REQUIRED — the delete lays a tombstone overlay on the branch (mergeable). Direct deletes from main via MCP are blocked.

Returns { deleted: true } on success, 404 when the (app_id, test_set_id) tuple doesn't resolve to a recording. Endpoint: https://api.keploy.io/client/v1/mcp
- delete_test_suite - Delete a test suite on a Keploy branch — synchronous, no playbook to walk.

USE THIS when:
  * The dev's update_test_suite call was rejected with "preserves no steps from the existing suite — that's a full rewrite, not an edit". Delete the existing suite and re-author from scratch via create_test_suite. The error message itself routes here.
  * The dev explicitly says "delete the suite", "remove suite X", "wipe my orderflow suite".
  * A genuine wholesale redesign — every step changed in shape — that the audit trail shouldn't try to reconcile as edits.

DO NOT USE THIS when:
  * The dev wants a real edit (one assertion, one step's body). Use update_test_suite + preserve existing step IDs instead — keeps audit history intact.
  * The dev wants to "redo" a single failed run. Test runs are independent of suite state; just rerun via replay_test_suite.

INPUT
  * app_id (required) — Keploy app id
  * suite_id (required) — UUID of the suite to delete
  * branch_id (required) — Keploy branch UUID. The delete creates a branch-scoped DeleteTestSuite audit event so reads on the same branch see the suite as gone. Direct main writes are blocked.

OUTPUT
  * On success: {"deleted": true} — suite is tombstoned at the branch overlay; subsequent reads (getTestSuite / listTestSuites) on this branch return 404 / exclude it.
  * 404 if the suite_id doesn't exist on this app/branch (verify via getTestSuite or listTestSuites first if you're unsure).

After delete, the standard re-create flow is: (1) call create_test_suite with a freshly authored steps_json. The new suite gets a fresh suite_id; the old id is tombstoned, not reusable.

═══════════════════════════════════════════════════════════════════
DISCOVERY — when the dev hands you a bare suite_id with no app_id / branch_id:
═══════════════════════════════════════════════════════════════════

Suites live on a (app_id, branch_id) tuple. A bare suite_id has no on-disk hint about which app or branch holds it; you have to RESOLVE both before calling this tool. Walk these steps in order — STOP as soon as getTestSuite returns 200:

  1. Detect the dev's git branch:  Bash `git rev-parse --abbrev-ref HEAD` in app_dir.
     If exit non-zero / output is "HEAD" → not a git repo / detached HEAD; ASK the dev for the Keploy branch name (don't invent one).

  2. Resolve candidate apps via the cwd basename:  Bash `basename $(pwd)` → call listApps with q=<basename>. Usually 1–2 candidates. If 0 → ASK; if >1 → walk every candidate in step 4.

  3. For each candidate app, call list_branches({app_id}) and find the branch whose `name` matches the git branch from step 1. That gives you {branch_id}. If no match → not this app, try next.

  4. Verify with getTestSuite({app_id, suite_id, branch_id=<from step 3>}). 200 → resolved; 404 → wrong app/branch, try next.

  5. If steps 2–4 exhaust, walk every OPEN branch on each candidate app, then try main (branch_id omitted). If still nothing → ASK the dev for the {app_id, branch_id} pair.

After resolving once in a session, REUSE the {app_id, branch_id} for subsequent suite-targeted calls; don't re-walk discovery for every action. Endpoint: https://api.keploy.io/client/v1/mcp
- devloop_begin_oauth_install - Start the Path B agent-walked install for the V1 flow. Returns a playbook the AI walks with the dev to land a kep_* PAT in their editor's MCP config.

Current implementation routes through app.keploy.io/settings/api-keys (the existing PAT-issuance surface). The full auto-provision OAuth round-trip (browser OAuth → /auth/temp-code/exchange → /user/mcp/provision → PAT auto-pasted) lands when the enterprise-ui /auth/cli-install page is built; this tool's interface won't change when that lifts, only the underlying steps will get shorter.

ABSOLUTE: this tool surfaces a public URL. Do NOT print API keys, PATs, or OAuth tokens to the conversation. The dev pastes the secret into their editor config directly; the agent never sees it. Endpoint: https://api.keploy.io/client/v1/mcp
- devloop_detect_app - Resolve a Keploy app and return a playbook the AI should walk to inspect the dev's repo for V1 ("user maintains the flow") test generation.

This MCP server has no filesystem access — the AI is the inspector. The tool returns: the resolved app metadata (id, name, base_url, storage_mode) plus a `detection_playbook` field that lists exactly what the AI should look for in app_dir (framework signatures, top-N candidate resources by handler count, port hints, existing keploy/api-tests/ directory). The AI uses its native read/grep tools to answer those questions, then calls devloop_generate_resource_flow with the findings.

Resolution model — pass EITHER:
  * app_id (UUID) — exact match, fast path.
  * app_name_hint (case-insensitive substring) — matches against existing apps in your tenant; resolves to one app or errors with the candidate list when ambiguous.

If storage_mode is unset, the DEVLOOP decision-gate defaults require devloop_resolve_storage to run first. This tool does not re-ask the storage question — it just surfaces whatever devloop_resolve_storage already persisted (or "" if neither has run yet). Endpoint: https://api.keploy.io/client/v1/mcp
- devloop_expand_coverage - Expand V1 API-test coverage from the single seed flow to the remaining detected resources. Use this AFTER devloop_mutation_demo has surfaced a positive catch result for the FIRST resource — that's the "manufactured proof" gate the dev needed before agreeing to scale.

Returns a procedure that loops over the dev-approved candidates:
  for each resource:
    devloop_generate_resource_flow(app_id, resource, app_dir, base_url)
    ASK dev: continue / stop / pick a different resource
  end

Mutation demo is NOT in the per-resource loop. Once a session has seen
mutation_demo run on the seed resource and prove its catch behavior,
re-firing it for every new resource produces busywork. The dev opts in
to mutation-on-expanded-resources via the post-expand multi-option
menu (see DevloopInstructionsAddendum "After devloop_expand_coverage"),
not as a default step inside this loop. If the dev asks "mutate this
too" mid-expand, fine — fire devloop_mutation_demo on that resource
on demand.

Stop conditions:
  * Dev says "stop" / "enough" / "later" → exit cleanly.
  * Any generate step errors → surface to dev, ASK whether to retry, skip, or stop.

DO NOT silently run all candidates without dev confirmation per
resource — the DEVLOOP decision-gate defaults explicitly require an
opt-in between each resource, because (a) the dev may want to inspect
each test before approving the next, and (b) a tangentially-named
candidate may be the wrong fit and the dev wants to swap. Endpoint: https://api.keploy.io/client/v1/mcp
- devloop_generate_resource_flow - Generate one chained-CRUD API test for a single resource. Behavior depends on the app's devloop_storage_mode (set this first via devloop_resolve_storage / devloop_set_storage_mode):

  * repo mode → returns a PLAYBOOK for you to walk. Steps: (1) run "keploy test-gen generate-from-code --app-dir <dir> --resource <name>" to scaffold the directory + empty config.yaml; (2) use your Write tool to author keploy/api-tests/<resource>/test.yaml using the schema returned by devloop_detect_app; (3) run "keploy test-gen run --test-dir keploy/api-tests --suite <Name>_CRUD --base-url <url> --ci" to verify the test parses and passes; (4) call devloop_mutation_demo next (auto, per the DEVLOOP instructions).

  * cloud mode → returns guidance to call the existing create_test_suite tool instead. The repo-mode playbook is NOT used in cloud mode.

ARGUMENTS — you should already have these from your devloop_detect_app call:
  * app_id, resource, app_dir, base_url, framework, handler_files. If any are missing, call devloop_detect_app again.

The tool does NOT generate the YAML body itself — you do, using the schema from devloop_detect_app's detection_playbook. This is intentional: ATG quality depends on the AI seeing the actual handler implementations (which it can read via its own tools) far better than a server-side generator could. Aim for ≤ 30 lines per test.yaml, idempotent mutating steps, chained extract/{{var}} flow. Endpoint: https://api.keploy.io/client/v1/mcp
- devloop_install_repo_artifacts - Generate the agent-discovery files that let dev #2 onboard via their AI agent without a human pointing at docs. The doc-stated "loop closure" step.

Emits THREE files (the AI writes them with its Write tool):

  * `.claude/skills/keploy/SKILL.md` — Claude Code skill. Has YAML frontmatter with a `description:` that triggers the skill semantically when the dev mentions Keploy / sandbox tests / mocks / api-tests / etc. Body describes the on-disk layout + how to drive Keploy via the MCP. Lazy-loaded — only burns context when the matcher fires.
  * `.cursor/rules/keploy.md` — Cursor MDC equivalent for Cursor users (Cursor doesn't read .claude/skills/).
  * `CLAUDE.md` — a tiny 3-line POINTER that says "Keploy is configured here; see .claude/skills/keploy/SKILL.md for details." Caters to editors that don't yet understand skills but DO read CLAUDE.md.

Idempotent. Re-running this tool against the same repo replaces the Keploy skill body in-place (overwrite-safe — we own the file entirely; the user is not expected to hand-edit it). The pointer in CLAUDE.md uses upsert semantics — only the Keploy line is touched; everything else in the file is preserved.

Returns the file paths + content + per-file action. The AI uses its Write tool per the `action` field. Endpoint: https://api.keploy.io/client/v1/mcp
- devloop_mutation_demo - Prove the just-generated API test actually catches bugs by applying 3 real source-level mutations to the handler, running the test against each, and reverting. The doc-stated "manufactured proof in the first session" moment.

OPT-IN, NOT OPT-OUT — this tool TOUCHES THE DEV'S SOURCE FILES (temporarily). Always ASK the dev for explicit consent before walking the playbook: "I'll apply 3 small temporary changes to <handler file> to prove the test catches them, then revert every change. Proceed?" Only run the playbook on "yes".

What the playbook does:
  1. Identify the handler file(s) the test exercises by reading <app_dir>/keploy/api-tests/<resource>/test.yaml and grepping for the route paths in the dev's code.
  2. Pick 3 concrete mutations the test assertion set should catch — e.g. change a response field's type (Name string → Name int), rename a field (email → mail), remove a field. Choose mutations that map to fields the test ACTUALLY asserts on (read the suite's assertions to inform the pick).
  3. For each mutation: apply via Edit, restart the dev's app if needed (hot-reload usually handles this), run keploy test-gen run, capture pass/fail, REVERT via Edit before moving to the next mutation.
  4. Run a final "git diff -- <handler file>" to verify all reverts succeeded. If non-empty, HALT and ask the dev to run "git checkout <file>" before continuing.
  5. Report: "I made 3 small changes, your test caught M/3. Caught: [concrete list]. Missed: [concrete list, with recommendation]."

ABSOLUTE RULES:
  * Revert is non-negotiable. The dev's working tree must be clean at the end.
  * Never modify test.yaml, config files, or anything outside the handler source(s) for this resource.
  * Never run more than 3 mutations in one playbook (more is noise, less is unconvincing).
  * If you can't identify a clear handler file, ASK the dev rather than guessing.

When the dev says "expand coverage to the other resources" → call devloop_expand_coverage next. Endpoint: https://api.keploy.io/client/v1/mcp
- devloop_record_sandbox - Record mocks for V1 repo-mode API tests using the V1-native CLI command `keploy sandbox local record`. Runs the dev's app under the keploy eBPF agent, drives the V1 chained-CRUD tests from `keploy/api-tests/<resource>/test.yaml`, captures every outbound call (DB queries, Redis ops, downstream HTTP) as mocks, and lays them out at `<app_dir>/keploy/<suite-name>/{tests/, mocks.yaml, config.yaml}` in the standard OSS test-set tree. On success, mocks upload to the Keploy canonical pool by content hash; the hash lands in config.yaml so a teammate's later replay fetches the same bytes.

CRITICAL — DO NOT CONFUSE WITH `keploy record sandbox`:
  * `keploy sandbox local record` (V1, repo-mode) ← this is what the playbook below uses
  * `keploy record sandbox` (legacy, cloud-mode) ← DO NOT call this for V1
  The two are entirely different commands. Cloud-mode requires server-side suites (queried via --suite-ids) — V1 repo-mode reads tests from the local filesystem and never registers them in the cloud. If the dev is in repo storage mode (verify via devloop_resolve_storage's source=persisted, mode=repo), V1 is the ONLY correct sandbox path.

STRICT — TIME-FREEZING DOES NOT APPLY TO RECORD.
Recording MUST use the dev's regular (prod) Dockerfile or native binary. NEVER spawn the app via Dockerfile.keploy / "-f docker-compose.keploy.yml" / "-tags=faketime" build during record. The faketime binary writes wrong timestamps into captured mocks (it reads time from the offset file, not the wall clock) and the entire capture becomes corrupt — recovery requires re-recording from scratch with the prod binary. If a previous replay failed with expired-JWT and the dev wants to "fix" it, the fix is to re-RUN the replay with --freezeTime, NOT to re-record. The recorded mocks captured against the prod binary are exactly what replay's clock-rewind is designed to validate; touching the record path defeats the whole mechanism.

ONLY call this with an explicit dev opt-in. The valid triggers:
  * Dev directly asks ("capture mocks", "sandbox record", "rerecord the users mocks").
  * Post-resource menu (Step 5 of devloop_generate_resource_flow) — dev picks "Capture mocks so CI runs in seconds".
  * get_session_report shows mock_mismatch_dominant=true AND the dev says yes to your "rerecord?" prompt.

Pre-conditions:
  * Dev's app must NOT already be running (keploy spawns its own copy of the app under the agent's eBPF hooks via the -c command). If a server is up at the target port, KILL IT first or the agent's network capture won't see the traffic.
  * Real downstream deps (MySQL, Redis, Kafka, etc.) MUST be running — the capture proxies through to them on first contact so the recorded mocks contain real responses.
  * The test YAML must exist at <app_dir>/keploy/api-tests/<resource>/test.yaml.

Returns a playbook for `keploy sandbox local record` with the V1 flag surface: --test-dir, --app-url, -c (spawn command), --container-name (docker-compose only), --skip-mock-upload (offline), --skip-report-upload (offline). Mocks land per-suite at keploy/<suite-name>/. NDJSON progress at --progress-file for the standard tail-til-done loop. Endpoint: https://api.keploy.io/client/v1/mcp
- devloop_resolve_storage - Resolve the storage mode for V1 ("user maintains the flow") API tests on this app.

═══════════════════════════════════════════════════════════════════
**MUST BE YOUR FIRST MCP CALL** for ANY of these dev verbs/intents:
═══════════════════════════════════════════════════════════════════
  * "run the sandbox tests" / "run the API tests" / "test sandbox" / "run keploy tests"
  * "record the sandbox" / "rerecord" / "refresh the mocks" / "capture mocks"
  * "replay the sandbox" / "replay the tests" / "show me the report" / "what failed in the last run"
  * "generate keploy tests" / "add a keploy test for <endpoint>"
  * "set up keploy in this repo" / "onboard this service to keploy"
  * any other reference to keploy/api-tests/, sandbox tests, integration tests, mocks, suites

REASON: this is the gate that determines whether the app is on the V1 (repo-mode) code path
or the legacy cloud-mode code path. **The two paths use entirely different MCP tool surfaces**:

  ┌───────────────────────┬─────────────────────────────────────────────────────────┐
  │ Storage mode          │ Tools to use                                            │
  ├───────────────────────┼─────────────────────────────────────────────────────────┤
  │ "repo"                │ devloop_* tools only. NO cloud-mode tools.                │
  │                       │ (record_sandbox_test, replay_sandbox_test,              │
  │                       │  replay_test_suite, create_test_suite, list_branches,   │
  │                       │  get_app_testing_context, listTestSuites etc. will      │
  │                       │  REFUSE with a redirect to the V1 surface.)             │
  ├───────────────────────┼─────────────────────────────────────────────────────────┤
  │ "cloud" or "" (unset) │ Cloud-mode tools (record_sandbox_test,                  │
  │                       │  replay_sandbox_test, replay_test_suite,                │
  │                       │  create_test_suite, list_branches, get_app_testing_     │
  │                       │  context, listTestSuites, etc.). devloop_* tools may      │
  │                       │  also be called for the V1 cloud-mode path.             │
  └───────────────────────┴─────────────────────────────────────────────────────────┘

DO NOT SKIP THIS. If you reach for cloud-mode tools first (replay_sandbox_test, list_branches,
listTestSuites, etc.) without calling devloop_resolve_storage, you WILL misroute repo-mode apps
and tell the dev to "upload local tests as suites and record into the cloud" — the EXACT
regression that prompted these MCP-side guardrails. The cloud-mode tools server-side gate on
devloop_storage_mode == "repo" and will refuse the call with a redirect message; devloop_resolve_
storage front-runs that refusal cleanly.

Resolution order:
  1. If app.devloop_storage_mode is set → return {mode, source: "persisted"}; do NOT re-ask.
  2. Else if the dev's repo (app_dir) already contains keploy/api-tests/ → ATEMPT to infer repo mode. This tool returns source="asked" with a hint asking you to check the dev's filesystem; if you confirm keploy/api-tests/ exists, call devloop_set_storage_mode({app_id, mode:"repo", reason:"inferred_local_tests_exist"}) and proceed silently.
  3. Else → return {source: "asked"} with the trade-off text in `message`; surface that to the dev, get yes/no, persist via devloop_set_storage_mode.

The AI is responsible for inspecting the repo (this MCP server does not have filesystem access). Use your native filesystem tools (read/grep) to check whether keploy/api-tests/ exists under app_dir.

APP RESOLUTION — the dev should NEVER have to type an app_id. Pass EITHER:
  * app_id (UUID) — exact, fast path. Use this once you've resolved it earlier in the conversation.
  * app_name_hint — a case-insensitive substring of the app name (typically the cwd basename). The tool calls listApps(q=hint) and resolves to a unique match.

If neither is set, the tool errors with the candidate list so you can ask the dev. If app_name_hint matches multiple apps, the error names them and asks you to disambiguate. If no app matches, you propose creating one (call createApp) BEFORE re-running this tool. Endpoint: https://api.keploy.io/client/v1/mcp
- devloop_scaffold_ci - Scaffold the GitHub Actions workflow that runs the V1 API tests on every PR. Returns the exact YAML content to write to .github/workflows/keploy.yml + the Bash command to set the KEPLOY_API_KEY secret. The AI walks the playbook with its Write tool + the `gh` CLI.

PRECONDITIONS — CHECK BEFORE CALLING. Calling this tool out of order is a DEVLOOP violation; the doc-stated user-flow ordering is generate → run → mutation-prove (opt-in) → expand (opt-in) → CI (opt-in). Specifically you must have:

  1. Generated at least one test via devloop_generate_resource_flow AND watched it pass via "keploy test-gen run --ci".
  2. SURFACED the mutation-prove opt-in to the dev verbatim: "Want me to prove the test catches bugs by applying 3 small mutations to your handler and reverting?" — and the dev answered (yes-walked through devloop_mutation_demo, or explicit no/skip/later). Doing the test runs is NOT the same as offering mutation-prove; the offer is a separate dev-facing question.
  3. ASKED the dev "want me to wire this into CI?" — explicit yes from the dev.

If ANY of those three are missing, STOP and back up. The mutation-prove gate is what builds the dev's trust before they commit Keploy to CI; skipping it ships shallow tests into a workflow the dev hasn't validated.

What this tool does NOT do (intentionally — the dev keeps custody):
  * Mint the CI API key server-side. The dev provisions it themselves in the Keploy dashboard (Step 2 of the returned playbook walks them through it). The AI never sees the kep_* value — it transits dashboard clipboard → terminal stdin → gh CLI's encrypted POST. This is a security property, not a limitation.
  * Post structured PR comments from api-server. V1 relies on GitHub Actions' native status-check rendering; the structured comment renderer is a V1.5 lift.

The emitted workflow runs on pull_request (default base branch) and reads app_id / test-dir / context-dir from keploy/api-tests/keploy-test-gen.yaml — the dev never has to thread flags through the workflow.

TIME-FREEZING — DEFAULT ON, ALMOST ALWAYS NEEDED FOR BACKEND APPS.

Almost every backend app has authentication (login → JWT/session/OAuth). The dev's recorded tests carry those tokens in headers. Between record time and the first PR's CI run, the tokens' exp claims pass real wall-clock — CI then 401s on every authenticated step, and the dev blames Keploy. Keploy's time-freezing rewinds the app's clock to the record moment so the recorded tokens validate.

Default policy: time_freezing=true. The AI MUST inspect the dev's test suites BEFORE calling this tool:
  - <app_dir>/keploy/api-tests/<resource>/test.yaml      (V1 sources)
  - <app_dir>/keploy/<SuiteName>/tests/*.yaml            (captured sandbox tests)
Look for: Authorization Bearer headers; steps hitting /login /auth /signin /token /oauth; response bodies containing jwt / token / access_token / refresh_token / expires_in / iat / exp. If any of those signals appear (or you're unsure), keep time_freezing=true. Only pass time_freezing=false when you've audited every suite and confirmed zero time-sensitive tokens (rare for a real backend).

When time_freezing=true, this tool also requires app_language (go / node / python / java / ruby / other) and app_service (docker-compose service name). Output then includes:
  - Modified workflow YAML (pre-populates keploy-sockets-vol; uses -f docker-compose.yml -f docker-compose.keploy.yml; passes --freezeTime)
  - docker-compose.keploy.yml override (volume mount + LD_PRELOAD for non-Go, or Dockerfile.keploy build for Go)
  - Dockerfile.keploy (Go ONLY — vDSO bypasses LD_PRELOAD, requires -tags=faketime rebuild)

The dev's plain "docker compose up" is unaffected. Time-freezing only activates when CI (or the dev locally) explicitly passes both compose files.

TIME-FREEZING IS REPLAY-ONLY — STRICT INVARIANT.
The Dockerfile.keploy / docker-compose.keploy.yml / --freezeTime flag this tool emits exist purely to make recorded JWTs validate at REPLAY time. They MUST NEVER apply when recording. Concretely:
  - Record uses the dev's PROD Dockerfile + plain "docker compose up" (no override file).
  - Replay uses Dockerfile.keploy + "docker compose -f docker-compose.yml -f docker-compose.keploy.yml up" + the --freezeTime flag on the CLI.
If a recording is captured against a faketime-built binary, every timestamp in the captured mocks is wrong and the whole capture is corrupt — there is no recovery short of re-recording from scratch with the prod binary. The CI YAML this tool emits in ci_mode=sandbox-replay is a REPLAY workflow; it boots via the compose override on purpose. The dev's separate record flow (devloop_record_sandbox) must NOT touch the override.

TIME-FREEZING IS FORCED ON FOR ci_mode=sandbox-replay — NON-NEGOTIABLE.
Any explicit time_freezing=false passed alongside ci_mode=sandbox-replay is silently overridden back to true. Rationale: sandbox replay processes the recorded request stream verbatim — any time-sensitive token in any captured request (JWT exp, OAuth iat, session cookie) goes stale the moment wall-clock passes the recorded moment, and silently fails replay. Whether the dev's suite happens to carry such a token is not auditable at scaffold time, and the failure is silent (401 on the first auth-gated step in CI). The cost of force-ON for a hypothetical zero-token app is one dormant volume mount + a no-op CLI flag; the cost of force-OFF for a token-bearing app is every PR failing. Asymmetric — force-ON wins. For ci_mode=api-tests, the workflow runs against live deps with current wall-clock so recorded tokens never enter the picture; time_freezing defaults to false and is overridable by the AI if they want the artifacts pre-staged for a later sandbox switch. Endpoint: https://api.keploy.io/client/v1/mcp
- devloop_schema_drift_report - Report the gap between the app's OpenAPI spec and its V1 API-tests coverage. Reuses the existing schema-coverage endpoint that get_coverage_gaps consumes, but frames the result for V1: which endpoints are uncovered, the recommended order to generate tests for them next (highest-fan-in first), and the count gap.

V1 covers the "spec-vs-tests" half of drift. The deeper "spec-vs-actual-handler-behavior" half (run the live app, diff responses against the OpenAPI spec) is a V2 lift — defer until V1 telemetry shows it would unblock real dev journeys. Endpoint: https://api.keploy.io/client/v1/mcp
- devloop_self_heal - Diagnose a failing V1 API test and propose an action. Triggered when:
  * keploy test-gen run exited non-zero in a fresh agent session (the dev's code changed since the test was authored).
  * A PR Keploy comment surfaced a failure and the dev says "fix it".

Action classes (mutually exclusive — the CLI picks one):
  * auto_update — diff is an unambiguous rename or refactor (same value moves to a different field name, or a step name changed without semantic effect). The AI can apply the proposed patch directly.
  * ask_user — the diff is ambiguous (a field disappeared, a value changed in a way that could be intentional). The AI surfaces the proposed alternatives and waits for the dev.
  * revert_suggestion — the diff looks like a real regression (status code drifted, expected response shape is incompatible). The AI suggests the dev revert the offending code; does NOT silently update the test.

HARD RULE: never rewrite an assertion to make a red test green. The CLI enforces this by refusing to emit an auto_update for any change that weakens an assertion. The AI must also enforce this — if the CLI's proposal includes "change expected_string from X to Y" where X was the correct contract, REJECT the proposal and ask the dev. Endpoint: https://api.keploy.io/client/v1/mcp
- devloop_set_storage_mode - Persist the V1 storage mode on an app. Call this AFTER devloop_resolve_storage returned source="asked" (the dev gave you a yes/no) or source="inferred" (you confirmed keploy/api-tests/ exists on disk).

Switching repo→cloud when local tests exist warns the dev that the existing local tests will be orphaned (Keploy will start sourcing tests from MongoDB; the keploy/api-tests/ files are no longer the source of truth). Surface the warning before calling this tool with mode="cloud" against a repo that has local tests. Endpoint: https://api.keploy.io/client/v1/mcp
- devloop_setup_instructions - Return the V1 "user maintains the flow" install guide — covers Path A (clipboard one-liner), Path B (agent-walked OAuth), and the keploy/api-tests/ on-disk layout V1 creates. Safe to call without authentication; call this when the dev asks 'how do I add Keploy to this repo?' or when a devloop_* tool returns missing-API-key. Endpoint: https://api.keploy.io/client/v1/mcp
- devloop_switch_to_sandbox - Switch the app's V1 CI from "boot the real app + deps" mode to sandbox mode (mocks fetched by content-hash from the cloud canonical pool). The doc-stated trigger: ~1 week after CI is wired, when the dev has felt the slow runs / flakes and you can pitch "your CI takes 90s and flaked twice this week — rerecord mocks and CI drops to ~8s."

What flips:
  * The CI workflow YAML gets a --sandbox flag on `keploy test-gen run` and the docker-compose-up step removed. This tool returns the updated YAML; you re-PR it.

Pre-condition: every resource you want in CI must have recorded mocks (config.yaml.mockRegistry.mock populated). Resources without mocks will fail in sandbox mode because there's nothing to serve. Run devloop_record_sandbox per resource first; verify via devloop_schema_drift_report-style checks before proposing the switch. Endpoint: https://api.keploy.io/client/v1/mcp
- download_recording - Download a recording — a session of captured API traffic (request/response pairs + outbound mocks) stored as a test_set. Recordings are INPUT artifacts captured by `keploy record`: they're raw traffic that AI generation (generate_and_wait) and manual create_test_suite flows turn into test suites. Use this to inspect what was captured before deciding how to turn it into suites.

NOT a sandbox-test export. A "sandbox test" (the suite + its captured mocks, produced by record_sandbox_test) lives behind a suite's `test_set_id` link — to inspect a suite, use getTestSuite for the step shape or replay_sandbox_test to see behavior. For the suite's mock bundle, do this two-step: (1) call getTestSuite to read the suite's test_set_id; (2) call listMocks({app_id, test_set_id}) (use ?include_specs=true to also fetch parsed mock YAML). Or for raw mock files on disk, point the dev at the artifact directory printed by record_sandbox_test (data.artifact_dir on the phase=done event). Endpoint: https://api.keploy.io/client/v1/mcp
- editMockMapping - POST /apps/{appId}/recordings/{testSetId}/test-cases/{testCaseId}/mock-mapping — Link or unlink a mock from a test case — Targeted mutation of the test case's entry in the mapping doc.
Add appends a mock entry if not already present; remove drops
the entry by name. Both idempotent — safe to retry on a network
blip.

The MCP layer exposes this as TWO tools (`link_mock` /
`unlink_mock`) — they both call this endpoint with the
appropriate `action`. Splitting at the MCP layer keeps each
tool's description tighter and avoids the LLM having to
remember the enum spelling.

Requires scope: `write`.
 Endpoint: https://api.keploy.io/client/v1/mcp
- exportRecording - GET /apps/{appId}/recordings/{testSetId}/export — Export a recording bundle — Export a complete recording bundle: test set metadata, all test cases, mocks, and test-to-mock mappings as a single JSON response. Use ?include_mocks=false to exclude mocks. Requires scope: `read`. Endpoint: https://api.keploy.io/client/v1/mcp
- generateTestSuites - POST /apps/{appId}/test-suites/generate — Generate test suites via AI — Requires scope: `write`. Endpoint: https://api.keploy.io/client/v1/mcp
- generate_and_wait - Generate test suites from an OpenAPI spec and wait for completion. Endpoint: https://api.keploy.io/client/v1/mcp
- getApp - GET /apps/{appId} — Get an app — Requires scope: `read`. Endpoint: https://api.keploy.io/client/v1/mcp
- getCurrentUser - GET /users/me — Get current user — Requires scope: `read`. Returns the user associated with the API key. Endpoint: https://api.keploy.io/client/v1/mcp
- getGeneratedSchema - GET /apps/{appId}/generated-schema — Get auto-generated OpenAPI schema — Returns the OpenAPI schema auto-generated from recorded traffic. Requires scope: `read`. Endpoint: https://api.keploy.io/client/v1/mcp
- getGenerationHistoryDetails - GET /apps/{appId}/generation-history/{jobId} — Get generation history details — Requires scope: `read`. Endpoint: https://api.keploy.io/client/v1/mcp
- getJob - GET /jobs/{jobId} — Get a job — Requires scope: `read`. Endpoint: https://api.keploy.io/client/v1/mcp
- getLoadTestReport - GET /apps/{appId}/load-tests/{runId} — Get a load test report — Requires scope: `read`. Endpoint: https://api.keploy.io/client/v1/mcp
- getMock - GET /apps/{appId}/recordings/{testSetId}/mocks/{mockId} — Read one mock's canonical YAML — Returns the canonical mock YAML doc (version/kind/name/spec)
for the named mock in the given test set. Branch-aware: when
`branch_id` is supplied, a branch-only upsert or tombstone
takes precedence over main.

Authoring workflow for AI agents: call this BEFORE updateMock
to fetch the existing payload, edit fields locally, then
round-trip the result through `mock_yaml` on updateMock.

Requires scope: `read`.
 Endpoint: https://api.keploy.io/client/v1/mcp
- getMockMapping - GET /apps/{appId}/recordings/{testSetId}/test-cases/{testCaseId}/mock-mapping — Read the mocks currently linked to a test case — Returns the mock entries in the mapping doc for the named
test case. Branch-aware via `branch_id` — when set, the
branch overlay's mapping wins over main's.

Workflow: AI agents should call this BEFORE editMockMapping
to inspect what's linked, then issue targeted add / remove
ops with confidence. Empty result (mocks: []) is normal —
means no mocks linked yet.

Requires scope: `read`.
 Endpoint: https://api.keploy.io/client/v1/mcp
- getRecording - GET /apps/{appId}/recordings/{testSetId} — Get recorded test cases — Returns individual recorded test cases within a test set, including HTTP request/response data. Requires scope: `read`. Endpoint: https://api.keploy.io/client/v1/mcp
- getSchemaCoverage - GET /apps/{appId}/schema-coverage — Get schema coverage — Requires scope: `read`. Endpoint: https://api.keploy.io/client/v1/mcp
- getSubscription - GET /company/subscription — Get company subscription — Requires scope: `read`. Endpoint: https://api.keploy.io/client/v1/mcp
- getSuiteReport - GET /apps/{appId}/test-runs/{runId}/suite-reports/{reportId} — Get a suite report — Requires scope: `read`. Endpoint: https://api.keploy.io/client/v1/mcp
- getTestCase - GET /apps/{appId}/recordings/{testSetId}/test-cases/{testCaseId} — Get a single test case — Returns a single recorded test case by ID within a recording session. Pass `branch_id` to scope the read to a branch overlay (bundle-uploaded test cases on a branch are invisible to main reads). Requires scope: `read`. Endpoint: https://api.keploy.io/client/v1/mcp
- getTestReport - GET /apps/{appId}/test-reports/{reportId} — Get a test run report — Returns a single test run report by ID. Requires scope: `read`. Endpoint: https://api.keploy.io/client/v1/mcp
- getTestRun - GET /apps/{appId}/test-runs/{runId} — Get a test run — Requires scope: `read`. Endpoint: https://api.keploy.io/client/v1/mcp
- getTestSuite - GET /apps/{appId}/test-suites/{suiteId} — Get a test suite — Requires scope: `read`. Endpoint: https://api.keploy.io/client/v1/mcp
- getUsage - GET /company/usage — Get company usage — Requires scope: `read`. Endpoint: https://api.keploy.io/client/v1/mcp
- getValidationResult - GET /jobs/{jobId}/validation-result — Get job validation result — Requires scope: `read`. Endpoint: https://api.keploy.io/client/v1/mcp
- get_app_testing_context - Fetch comprehensive context about an app's API plus the canonical test-suite authoring schema. Returns:
  * app — config, auth shape, appLevelCustomVariables (READ THIS for R32 — your extract keys must not collide with these names)
  * coverage — API coverage report (which endpoints have tests, which don't)
  * recordings — summaries of captured traffic sessions
  * test_suites — existing suites (check before authoring to avoid duplicates)
  * generated_schema — AI-extracted OpenAPI for the app
  * step_schema — THE CANONICAL TEST SUITE STEP SCHEMA. Same content as `keploy test-suite-format`, shipped inline so you don't need a separate tool-call hop. Read this BEFORE authoring or curling endpoints — it contains the MANDATORY rule block (R10 / R9 / R2 / R15 / R32) the validator enforces on iter 1, plus the canonical two-step prelude+POST skeleton.
  * authoring_directive — one-line reminder pointing at step_schema.

Call this FIRST when authoring suites. The step_schema field eliminates the most common iter-1 failure (AI authors based on training-data priors before reading the validator's rules). Endpoint: https://api.keploy.io/client/v1/mcp
- get_auth_status - Check whether the current MCP session has a valid-looking Keploy PAT attached. Returns { authenticated, auth_header_present, hint }. Useful as a first step for an agent to decide whether to call other tools or show setup instructions — differentiates 'no header at all' from 'header present but wrong prefix / unsupported format'. Endpoint: https://api.keploy.io/client/v1/mcp
- get_coverage_gaps - Get API coverage and prioritized suggestions for uncovered endpoints. Endpoint: https://api.keploy.io/client/v1/mcp
- get_session_report - Fetch the report for a completed run. ONE tool, THREE report kinds — the response's top-level `kind` field discriminates which kind it is (rerecord / sandbox_run / test_suite_run) and which question the report answers (see core glossary's "three reports"). Read `kind` first, then pick the matching reading rules below; do NOT assume the kind from how you got here.

Call this as the final step of the playbook, AFTER you read the terminal NDJSON event (phase=done) and confirmed data.ok=true. Pass app_id and test_run_id — extract test_run_id from data.test_run_id on the phase=done line of the progress_file returned by record_sandbox_test or replay_sandbox_test (for replay_test_suite, the CLI prints test_run_id to stdout instead).

===== OUTPUT SHAPE =====
(Conditional verbosity so the dev isn't drowned in noise on a green run.)
  * Always includes totals at the SUITE level only (total_suites / passed_suites / failed_suites) and a per_suite array where each entry carries suite_id, suite_name, total_steps, passed_steps, failed_steps. Aggregate step counts across suites are intentionally omitted — they hide where damage actually is.
  * PER-KIND READING of passed_steps / failed_steps — same column names, different meaning per kind:
      - RERECORD (kind=rerecord): passed_steps = steps whose auto-replay byte-comparison matched the live capture. failed_steps = steps that diverged on auto-replay. EVEN IF every suite shows passed_steps == total_steps, the rerecord is only successful when every suite is also linked=true (a sandbox test got produced). Always check `linked`; the step counts alone do not indicate "did the rerecord work".
      - SANDBOX_RUN (kind=sandbox_run): passed_steps = steps whose assertions held under captured-mock replay. failed_steps = assertion failures or response diffs against the captured baseline.
      - TEST_SUITE_RUN (kind=test_suite_run): passed_steps = steps whose assertions held against the live app. failed_steps = same against live, no mocks involved. No linkage to report.
  * Top-level `kind` discriminates the report: `"rerecord"` for record_sandbox_test runs (rerecord report — answers "did the sandbox test get created and linked?"), `"sandbox_run"` for replay_sandbox_test runs (sandbox run report — answers "does the suite still hold up against its captured baseline?"), `"test_suite_run"` for replay_test_suite runs (test suite report — live execution, no mocks; answers "does the suite hold up against the actual current system?"). Use kind to pick the right reading; do NOT mix them in one response.
  * RERECORD runs (kind="rerecord") carry a `linked` bool + `test_set_id` string on every per_suite[] entry. linked=true means the rerecord produced a sandbox test for the suite (replay-ready). linked=false means rerecord did NOT produce a sandbox test for the suite — it cannot be replayed until rerecord succeeds. ALWAYS surface this on rerecord output — even when every step's capture passed at the wire level, a suite without a sandbox test is a real failure. For the per-suite table, add a "Linked" column (yes/no from per_suite[].linked). For the one-line all-green reply, report "N/N suites passed, L/N have a sandbox test (test_run_id=<id>)".
  * When any suite has failures (or verbose=true), also includes failed_steps[] with per-step diagnostics (suite, step name, method+url, diff excerpt, error, mock_mismatches, assertion_failures, mock_mismatch_failure, authored_assertions, authored_response_body) PLUS mock_mismatch_failed_steps (count) and mock_mismatch_dominant (bool — true when the majority of failed steps have unconsumed recorded mocks, which points at a keploy-side egress-hook issue rather than dev app breakage). On RERECORD, failed_steps[] also carries `linked` (whether the owning suite has a sandbox test after this rerecord) and the mock_mismatch_* fields are suppressed (irrelevant in rerecord context).
  * authored_assertions / authored_response_body — the SUITE's authored contract for the failing step (the assert array and response.body as defined when the suite was created/updated). Surfaced inline so route B vs route C can be decided without a second getTestSuite round-trip. KEY DECISION POINT: if any authored_assertions entry is pinned to the value the diff shows as "expected" (e.g. assert {path: "$.order.status", expected: "created order"} and the diff says "expected 'created order', got 'created'"), route C is MANDATORY — re-record alone leaves that assertion stuck on the old contract and the next rerecord/replay will gate-1-fail on the same step. If authored_assertions is empty/absent (suite asserts nothing structural on that field), route B or route-C-without-assertion-edit may suffice.
  * When everything passes and verbose is false, failed_steps is omitted.

===== HOW TO RESPOND TO THE DEV =====
  * status == "all_passed" AND kind == "sandbox_run" → ONE-LINER: "<passed_suites>/<total_suites> suites passed (test_run_id=<id>)". Do not dump the JSON, do not list per-suite rows unless asked.
  * status == "all_passed" AND kind == "test_suite_run" → ONE-LINER: "<passed_suites>/<total_suites> suites passed live (test_run_id=<id>)". No mocks involved, no linkage to report.
  * status == "all_passed" AND kind == "rerecord" → ONE-LINER including linkage: "<passed_suites>/<total_suites> suites passed, <linked>/<total> linked (test_run_id=<id>)" where <linked> = count of per_suite[] entries with linked=true. If linked < total, ALSO list the unlinked suite names so the dev knows which ones are silently broken (skip sandbox replay on them, or investigate the linking failure). Never drop linkage reporting on rerecord even when it's all green.
  * status == "has_failures" → response MUST contain (in order, no collapsing rows even when failures look homogeneous — the dev needs the full inventory):
      1. per-suite table — one row per suite in per_suite (passing suites included), columns = Suite name | passed/total steps.
      2. failed-steps table — ONE ROW per entry in failed_steps[], columns = Suite | Step name | Method + URL | Expected → Actual status | mock_mismatch y/n.
      3. Diagnosis + Recommendation (rules below).
    Do NOT print aggregate step totals across suites.

Frame the diagnosis from the glossary: a mock mismatch IS the signal that the sandbox test has drifted from current app behavior. The three routes below (SKIP / FIX-CODE / FIX-TEST-RERECORD) are not separate buckets — they're three possible SOURCES of that drift:
  * keploy proxy didn't replay correctly → drift is artificial, no real change → route A (SKIP).
  * app regressed → drift is unintended, fix the code → route B.
  * contract changed on purpose → drift is intentional, refresh the sandbox test → route C.
Your repo inspection picks which source applies; the routes are the prescription for that source.

DIAGNOSE WITH THE REPO, NOT THE DEV. Before recommending anything on a failing run, inspect the source tree yourself (git log / git diff against the last green run or main, read the failing handler + its downstream call sites). DO NOT ask the dev "did you change X since the last green run" — you have the repo, find the answer. Only come back with a concrete conclusion.

  * mock_mismatch_dominant == true → failure signature is "keploy didn't intercept the app's egress traffic". Use git to check whether the failing endpoints or their dependency wiring have been modified recently:
      (a) NO relevant changes → tell the dev this is almost certainly a KEPLOY-SIDE issue and ask them to file a keploy issue with test_run_id. Do NOT ask them to re-record.
      (b) Relevant changes EXIST → name them (file:line or commit hash), explain how each plausibly caused the failure, say whether the change looks intended or accidental, and tell the dev exactly what to fix.
  * status == "has_failures" AND mock_mismatch_dominant == false → same discipline: identify the commit(s) / diff hunks that most likely caused each failure, state whether they look intended, and prescribe a fix (rerecord, revert, patch the handler). Don't hand the investigation back to the dev.

===== HANDLING "FIX IT" FOLLOW-UPS =====
(After the dev has seen the analysis and asks you to fix.)

═══════════════════════════════════════════════════════════════════
DO NOT JUMP TO RECORD — diagnose FIRST.
═══════════════════════════════════════════════════════════════════

A sandbox-replay failure is NOT a signal to rerecord. Re-recording without diagnosis silently captures the broken behavior as the new "expected" — masking a real app regression and erasing the evidence the dev needs.

When sandbox replay fails, your FIRST move is ALWAYS the diagnosis below (B vs C vs SKIP). You only call record_sandbox_test as part of route C, AND only AFTER update_test_suite has updated the suite to match the new intentional contract. If the contract hasn't changed (route B), DO NOT record — the captured mocks are still valid; only the app needs fixing.

If you find yourself thinking "let me just rerecord to fix this", STOP. Read failed_steps, inspect the repo for what changed, decide which route applies. Re-recording is a tool for capturing a NEW intentional contract, not a remedy for a failed run.

You have exactly THREE options for each failing step. Pick one per step based on your repo inspection; do not ask the dev which branch to take, decide:

  A. SKIP — do nothing code-side. Pick this when mock_mismatch_dominant=true AND your repo inspection found no relevant changes in the failing handler or its dependencies. Rationale: this is a keploy egress-hook / proxy issue; editing the app or the test won't help. Tell the dev "flagged for keploy support, no app or test change needed" and move on to the next step (if any) or close.

  B. FIX THE CODE — edit the handler / dependency wiring. Pick this when your repo inspection shows a recent change that broke the endpoint's contract AND the ORIGINAL test intent still matches what the endpoint SHOULD do (the test is correct, the code regressed). Make the minimal edit to restore expected behavior, tell the dev exactly which file:line you changed and why, then re-run: call replay_sandbox_test for the suite(s) whose steps you just un-broke. DO NOT record — the captured mocks are still valid if the contract hasn't changed intentionally.

  C. UPDATE-FIRST, THEN RECORD — order matters: (1) update_test_suite first, (2) record_sandbox_test second, (3) replay_sandbox_test to verify. Calling record before update means you'd capture mocks against the OLD suite shape — defeats the purpose. Pick this when the endpoint's contract LEGITIMATELY changed (a deliberate new field, renamed response key, different status code, new required header) AND your repo inspection confirms the change is intended (commit message, surrounding diff, or obvious product direction). The update_test_suite call should edit the step's body / expected response / assertions / extract to match the new contract. Tell the dev which assertions you updated and why the contract change is considered intentional.

  ╔═══ ROUTE C — DECISION + RECOMMENDATION TEMPLATE (use verbatim) ═══╗
  Decision input: read failed_steps[].authored_assertions and authored_response_body INLINE in this report. Do NOT call getTestSuite again unless those fields are absent (older runs).
    * If an authored assertion's expected value matches the diff's "expected" side → route C is MANDATORY. The suite's contract pins the old value; you MUST update_test_suite before record_sandbox_test, otherwise the next rerecord gate-1-fails on the same assertion and the suite comes back unlinked.
    * If authored_response_body has the old value but no assert is pinned to it → route C is still recommended (the captured response baseline drifts), but record_sandbox_test alone CAN succeed; choosing update_test_suite first keeps the suite source-of-truth aligned with the new contract.
    * If neither pins the diverging value → route C without assertion edits is sufficient (or route B if the change is unintentional).

  Mandatory recommendation phrasing for the dev (one bullet per failing step that routes to C):
    "(1) update_test_suite for suite '<suite_name>' (id=<suite_id>) — change step '<step_name>' (id=<step_id>): set <field_path> from '<old>' to '<new>' and update assertion <assert_index> on the same path; (2) record_sandbox_test on that suite to refresh the captured baseline; (3) replay_sandbox_test to verify."

  BANNED wording — never write any of these on a route-C recommendation:
    × "re-record the sandbox tests so the baseline picks up the new value"
    × "just rerecord to refresh the captured response"
    × "re-record and the new value will become the expected"
    × "re-record OR update assertions" (or any phrasing that joins update_test_suite and record_sandbox_test with "or" / "either … or" / "one of these two")
    × "you can either update the assertions or re-record"
    × "options: (a) update assertions, (b) re-record the suite"
  All five drop step (1) or present the two steps as interchangeable. They are NOT alternatives — they are sequential steps in a single route-C flow: (1) update_test_suite, (2) record_sandbox_test, (3) replay_sandbox_test. Skipping (1) leaves the suite's authored assertion pinned on the old value; the next replay gate-1-fails on the same diff. If you catch yourself reaching for "or" between these two tools on a route-C recommendation, restate using the mandatory template.
  ╚════════════════════════════════════════════════════════════════════╝

Multiple failing steps can land in DIFFERENT branches — e.g. one step is a real app regression (B), another is a contract change (C). In that case, explain the split up-front, apply each fix, and run sandbox replay once at the end covering every affected suite.

After any B or C branch completes, the final message uses the same 3-subsection format (per-suite table → failed-steps table → diagnosis + recommendation) on the follow-up sandbox replay, PLUS a short "Fix applied" preamble naming the file:line edits (for B) or update_test_suite calls (for C). For A-only responses (all failures route to keploy), no follow-up run is needed — just restate the keploy-issue recommendation.

===== REPLAY / "EXPLAIN MY LATEST SANDBOX REPORT" =====

When the dev asks "explain my latest sandbox report" / "analyse the last run" / "why did it fail" — call this tool again with the SAME app_id + test_run_id and verbose=true so the full diagnostics come back even if nothing failed. Use that detail to answer their question. If you don't have the test_run_id to hand, list the app's most recent runs OF THE RIGHT KIND via /client/v1/apps/{app_id}/test-runs?kind=<rerecord|sandbox_run|test_suite_run> and pick the top one. NEVER list /test-runs without the kind filter and pick the latest blindly — different kinds are co-mingled in that collection, and an unfiltered list will surface a rerecord run when the dev asked for the latest sandbox replay (or vice versa). Match the kind to what the dev asked: "explain my latest record" → kind=rerecord; "explain my latest sandbox replay" / "integration test report" → kind=sandbox_run; "explain my latest live run" → kind=test_suite_run. If the dev's verb is ambiguous, ASK which kind first (per the verb-routing's explain-branch rule). Endpoint: https://api.keploy.io/client/v1/mcp
- get_setup_instructions - Return step-by-step instructions for connecting this MCP server to Cursor, Windsurf, VS Code, Claude Desktop, Claude Code, or Trae. Safe to call without authentication — use this when the user asks 'how do I set up Keploy MCP?' or when a tool call fails with an auth error. Endpoint: https://api.keploy.io/client/v1/mcp
- importRecording - POST /apps/{appId}/recordings/{testSetId}/import — Import test case changes into a recording — Bulk import test case changes: update existing test cases (by ID), insert new ones (without ID), and delete specified test cases. Pass `branch_id` (query or body) to scope the import to a branch overlay. Requires scope: `write`. Endpoint: https://api.keploy.io/client/v1/mcp
- link_mock - Attach a mock to a test case's mock-mapping doc. Use this when the dev has authored a mock (via create_mock or capture) and wants the runtime to actually consume it when replaying a specific test case.

Idempotent — re-linking an already-linked mock is a no-op (no duplicate entry). `branch_id` is REQUIRED — direct writes to main via MCP are blocked.

Mirror tool: unlink_mock. Endpoint: https://api.keploy.io/client/v1/mcp
- listAPIKeys - GET /api-keys — List API keys — Requires scope: `admin`. Endpoint: https://api.keploy.io/client/v1/mcp
- listApps - GET /apps — List apps — Returns the tenant's apps. Use the optional `q` query parameter to name-filter (case-insensitive substring, e.g. `?q=orderflow` → apps whose name contains 'orderflow'); without it the full paginated list is returned. Callers that know the app's folder / repo name should pass it as `q` to avoid paginating through hundreds of apps. Requires scope: `read`. Endpoint: https://api.keploy.io/client/v1/mcp
- listAppsByCluster - GET /apps/by-cluster/{clusterId} — List apps in a cluster — Returns apps belonging to a specific cluster. More efficient than iterating all apps. Requires scope: `read`. Endpoint: https://api.keploy.io/client/v1/mcp
- listAppsWithRecordings - GET /apps/with-recordings — List proxy apps with network recordings — Returns all k8s-proxy apps (origin.type=PROXY). These apps are auto-created by the Keploy k8s-proxy agent on first recording and contain network recordings of ingress HTTP traffic (as Keploy test cases) and egress dependency calls — database queries, external API calls, message queues — captured as Keploy mocks. Use listRecordings and getRecording to access the recorded request/response pairs and dependency mocks from live environments. Requires scope: `read`. Endpoint: https://api.keploy.io/client/v1/mcp
- listBranches - GET /apps/{appId}/branches — List Keploy branches on an app — Returns every Keploy branch on an app, optionally filtered by status. Used by MCP write tools and CI integrations to enumerate branches without shelling out to the CLI. Requires scope: `read`. Endpoint: https://api.keploy.io/client/v1/mcp
- listClusters - GET /clusters — List clusters — Returns all clusters for the authenticated company. Requires scope: `read`. Endpoint: https://api.keploy.io/client/v1/mcp
- listGenerationHistory - GET /apps/{appId}/generation-history — List generation history — Requires scope: `read`. Returns all entries (no pagination). Endpoint: https://api.keploy.io/client/v1/mcp
- listJobs - GET /jobs — List jobs — Requires scope: `read`. Endpoint: https://api.keploy.io/client/v1/mcp
- listLoadTestRuns - GET /apps/{appId}/load-tests — List load test runs — Requires scope: `read`. Returns all runs (no pagination). Endpoint: https://api.keploy.io/client/v1/mcp
- listMocks - GET /apps/{appId}/recordings/{testSetId}/mocks — List mocks for a recording — Returns mock reference metadata and optionally parsed mock specs for a test set. Use ?include_specs=true to download and parse the actual mock YAML from object storage. Pass `branch_id` to scope the read to a branch overlay (bundle-uploaded recordings on a branch are invisible to main reads). Requires scope: `read`. Endpoint: https://api.keploy.io/client/v1/mcp
- listRecordings - GET /apps/{appId}/recordings — List recording sessions — Returns test sets (recording sessions) for an app. Requires scope: `read`. Endpoint: https://api.keploy.io/client/v1/mcp
- listSuiteReports - GET /apps/{appId}/test-runs/{runId}/suite-reports — List suite reports for a test run — Requires scope: `read`. Supports cursor-based pagination. Endpoint: https://api.keploy.io/client/v1/mcp
- listTestCaseReports - GET /apps/{appId}/test-reports/{reportId}/test-set-reports/{testSetReportId}/test-cases — List test case reports — Returns individual test case results with expected/actual diffs within a test set report. Requires scope: `read`. Endpoint: https://api.keploy.io/client/v1/mcp
- listTestReports - GET /apps/{appId}/test-reports — List test run reports — Returns test run report summaries for an app with pass/fail counts. Requires scope: `read`. Endpoint: https://api.keploy.io/client/v1/mcp
- listTestRuns - GET /apps/{appId}/test-runs — List test runs — List test runs for an app. Optional `kind` query param filters by run kind: `rerecord` (record_sandbox_test runs), `sandbox_run` (replay_sandbox_test runs), or `test_suite_run` (replay_test_suite live runs). Omit to return runs of every kind. Requires scope: `read`. Endpoint: https://api.keploy.io/client/v1/mcp
- listTestSetReports - GET /apps/{appId}/test-reports/{reportId}/test-set-reports — List test set reports within a run — Returns per-test-set results within a test run report. Requires scope: `read`. Endpoint: https://api.keploy.io/client/v1/mcp
- listTestSuites - GET /apps/{appId}/test-suites — List test suites — List test suites for an app. Optional `has_sandbox_test` query param filters by sandbox-test linkage: `true` returns only suites that have a sandbox test (linked=true / test_set_id populated); `false` returns only suites without one. Omit to return every suite. Requires scope: `read`. Supports cursor-based pagination. Endpoint: https://api.keploy.io/client/v1/mcp
- list_branches - List Keploy branches on an app. Use this BEFORE any write tool (create_test_suite, update_test_suite, sandbox flows) to see if a branch already exists for the dev's current work. If one matches the dev's git branch, pass its branch_id (or branch_name) to the write tool. Otherwise call create_branch.

Output: JSON array of {id, name, status, created_at}. Status one of: open, review_requested, approved, changes_requested, merged, closed, conflict.

Optional status filter narrows the list (e.g., "open" hides merged/closed branches).

Output: {"branches": [{id, name, status, created_at, updated_at}, ...]}. Synchronous — no playbook to walk. Endpoint: https://api.keploy.io/client/v1/mcp
- normalizeSuiteReport - POST /apps/{appId}/test-runs/{runId}/suite-reports/{reportId}/normalize — Normalize a suite report — Requires scope: `write`. Endpoint: https://api.keploy.io/client/v1/mcp
- normalizeTestRun - POST /apps/{appId}/test-runs/{runId}/normalize — Normalize a test run — Requires scope: `write`. Endpoint: https://api.keploy.io/client/v1/mcp
- record_sandbox_test - Record (or refresh) the sandbox test for one or more existing test suites — captures the request/response per step plus the outbound mocks (DB, downstream HTTP, etc.) against the dev's locally-running app, then links the captures onto the suite. Use this when the dev says "record", "rerecord", "re-record", "refresh the recordings", "capture mocks", or as the RECORD step in FROM-SCRATCH (after create_test_suite).

This tool resolves the app (if only a hint is given), resolves ONE OR MORE suites to record (by exact ids OR case-insensitive name substring match), and delegates to a headless playbook. Output produces a RERECORD REPORT — it answers "did the sandbox test get created and linked successfully?".

╔═══ PRE-CHECK — DID YOU ARRIVE HERE FROM A FAILED REPLAY? ═══╗
This tool refreshes the CAPTURED BASELINE (mocks + recorded request/response per step). It does NOT modify the suite's authored assert array or response.body — those are the contract as defined when the suite was created/updated. If the contract changed and you re-record without updating the suite first, the new rerecord fires the suite's stale assertions against the live app, gate-1-fails on the same diff, and the suite comes back unlinked.

Before calling THIS tool in response to a failed replay_sandbox_test or replay_test_suite, walk these checks:

  1. Read failed_steps[].authored_assertions and authored_response_body in the most recent get_session_report (kind=sandbox_run / test_suite_run). The fields are inlined — no second tool call needed unless the report predates the inlined fields.
  2. For each failing step: does any authored assertion pin the diverging value? (e.g. assert {path: "$.order.status", expected: "created order"} where the diff says "expected 'created order', got 'created'".)
       * YES → call update_test_suite FIRST to update that assertion + the response.body field, THEN call this tool.
       * NO → safe to call this tool directly; the captured baseline drifts but no authored assertion blocks the rerecord.
  3. If you can't find authored_assertions in the report (older format) AND don't already know the suite's shape, call getTestSuite({app_id, suite_id, branch_id}) to inspect the assert array before deciding. Don't guess.

REFUSE-RULE: if the dev confirms a contract change is intentional and the failing step has a pinned authored assertion on the diverging value, you MUST run update_test_suite before this tool. Calling record_sandbox_test FIRST in that case is the bug this pre-check exists to prevent — don't justify it as "let's just refresh the baseline first". The order is update → record → replay; never record → update.
╚═══════════════════════════════════════════════════════════════╝

===== BEFORE CALLING — one-time setup =====

(a) APP_ID RESOLUTION (skip if app_id is already known):
    * Derive a likely app name from the cwd's basename (e.g. cwd=/home/dev/orderflow → "orderflow"). Lowercase it.
    * Call listApps({q: "<cwd-basename>"}) — the server does a case-insensitive server-side substring match, so you don't paginate the full tenant list (can be hundreds of apps on shared accounts).
    * Exactly one match → use its id. Multiple → list them and ASK the dev which one (a wrong app_id silently routes traffic + suite creates into the wrong app). Exception: if the compose file / repo layout unambiguously pins one candidate (e.g. compose has service "producer" and one candidate is "<folder>.producer" while others are unrelated siblings), you may pick it AND tell the dev up-front so they can correct.
    * Zero matches → ASK permission to create a new Keploy app with the derived name; on yes, call createApp({name, endpoint}) and use the returned id.
    * Alternatively pass app_name_hint to THIS tool and the server resolves it (same rules; multiple/zero → typed error).

(b) KEPLOY BINARY VERIFICATION:
    * Bash: "keploy --version" (or "~/.keploy/bin/keploy --version"). If it exits non-zero the binary is missing.
    * If missing OR older than this MCP server was built against, install/upgrade: curl --silent -O -L https://keploy.io/ent/install.sh && source install.sh
    * Re-verify with "keploy --version"; fail loudly if still absent (tell the dev where keploy put the binary so they can add it to PATH).

===== DOCKER-COMPOSE NETWORK RULE (absolute) =====

Use the SAME compose file + service that was used in the validate-curl phase. Do NOT point keploy at a second "keploy-only" compose file — docker-compose isolates each file into its own project + network, so the app container spawned by keploy cannot reach the DB/Kafka containers that validate brought up (and the network-name collision blocks keploy from starting). Correct flow:
  (i)   Validate phase: "docker compose up -d" (brings up app + deps on network <project>_default).
  (ii)  Before calling record_sandbox_test, Bash: "docker compose stop <app_service> && docker compose rm -f <app_service>" — stop ONLY the app service; leave deps running so keploy's new app container can reach them on the existing network.
  (iii) Pass app_command = "docker compose up <app_service>" (same compose file, same project → same network). container_name = the actual name set by compose (e.g. "orderflow-producer", not "producer").

===== RESOLUTION RULES (server-side, no guessing) =====
  1. App: caller provides app_id OR app_name_hint. With a hint, the server does listApps({q: hint}). Zero matches → typed error; multiple → typed error listing them so Claude asks the dev.
  2. Suites: DEFAULT IS "ALL LINKED". When the dev says "record my sandbox tests" / "rerecord everything" / "refresh my recordings" with no specific suite named, LEAVE BOTH suite_ids AND suite_name_hint UNSET. Do NOT list suites first and pass a comma-joined UUID list back — the CLI resolves "every linked suite for the app" itself, cleaner and less brittle. Only pass a narrower selector when the dev explicitly names suites:
       - suite_ids (comma-separated, exact) — when you already have the IDs.
       - suite_name_hint (case-insensitive substring match) — when the dev names suites by human phrasing like "the auth suite" or "deterministic". Every suite whose name contains the substring is recorded.
     If the dev asks to record suites that don't exist yet (zero match) → typed error. Any ≥1 match is fine.
     DO NOT prompt the dev for which suites to record — default to all linked if they didn't name any.

===== DISCOVERY — when the dev hands you a bare suite_id with no app_id / branch_id =====

Suites live on a (app_id, branch_id) tuple. A bare suite_id has NO on-disk hint about which app or branch holds it; you have to RESOLVE both before calling this tool. Walk these steps in order — STOP as soon as getTestSuite returns 200:

  1. Detect the dev's git branch:  Bash `git rev-parse --abbrev-ref HEAD` in app_dir.
     If exit non-zero / output is "HEAD" → not a git repo / detached HEAD; ASK the dev for the Keploy branch name.

  2. Resolve candidate apps via the cwd basename:  Bash `basename $(pwd)` → call listApps with q=<basename>. Usually 1–2 candidates. If 0 → ASK; if >1 → walk every candidate in step 4.

  3. For each candidate app, call list_branches({app_id}) and find the branch whose `name` matches the git branch from step 1. That gives you {branch_id}. If no match → not this app, try next.

  4. Verify with getTestSuite({app_id, suite_id, branch_id=<from step 3>}). 200 → resolved; 404 → wrong app/branch, try next.

  5. If steps 2–4 exhaust without a hit, walk every OPEN branch on each candidate app via list_branches → getTestSuite. Then try main (branch_id omitted). If still nothing → ASK the dev for the {app_id, branch_id} pair.

The standard pattern when "search the suite by id" returns nothing is NOT "give up and ask the dev which app" — it's "the suite exists on a BRANCH, walk discovery". Suites created via create_test_suite + rerecord on a Keploy branch are INVISIBLE to a main-view listTestSuites; you have to scope each call to a branch.

After resolving once in a session, REUSE the {app_id, branch_id} for any subsequent suite-targeted call (replay_sandbox_test, update_test_suite, replay_test_suite); don't re-walk discovery for every action.

===== PREREQUISITES =====
  - app_command: shell command that starts the dev's app (e.g. "docker compose up producer").
  - app_url: base URL the app listens on, e.g. http://localhost:8080.
  - app_dir: absolute path to repo root.
  - container_name if app_command is docker-compose.
  - keploy binary on PATH. If `which keploy` returns nothing, install it before calling this tool with: `curl --silent -O -L https://keploy.io/install.sh && source install.sh`.

===== AFTER CALLING — walk the playbook =====

The response includes a "playbook" array; execute its steps in order. The flow is HEADLESS — one background process, NDJSON progress events on a local file, no separate HTTP surface to bind. THERE IS NO SEPARATE CLEANUP STEP — the CLI exits on its own once phase=done is written.
  1. Spawn the `keploy record sandbox --cloud-app-id …` process via Bash (run_in_background). Capture its PID into $KEPLOY_PID.
  2. Poll progress by repeatedly calling Bash with `tail -n 1 $PROGRESS_FILE`. Each call returns instantly; the MCP round-trip between calls paces the loop. DO NOT wrap in a sleep loop — Claude Code's Bash rejects standalone `sleep N` and chained-sleep patterns. Read .phase off each line; stop when phase=done. The wait_for_done step's built-in `kill -0 $KEPLOY_PID` check is the safety-net for silent early-exit (CLI died before writing the terminal event) — it lets the loop exit instead of spinning forever on a dead process.
  3. Read the terminal event (last line of $PROGRESS_FILE). It carries data.ok, data.error (on failure), data.test_run_id (on success).
  4. On data.ok=true: call get_session_report(app_id, test_run_id) with verbose=true to surface the rerecord report. On data.ok=false: show data.error to the dev directly (optionally tail the log_file for stderr context) and SKIP get_session_report (there's no run to fetch).

Auto-replay + linkTestSetToSuite run INSIDE the CLI process before it writes phase=done — if the terminal event says ok=true, linkage already happened. You do NOT need to wait for a separate post-success window; the CLI doesn't exit until it's fully done.

INTERRUPTED FLOWS: if your conversation dies between step 1 and step 2 (Claude crashes, connection drops, dev cancels), the CLI keeps running in the background. It's not orphaned — it'll finish its run and write phase=done. To abort early, the dev can `pkill -f "keploy.*sandbox"` manually; otherwise just let it complete and resume by re-reading the progress file on the next turn.

===== NDJSON SCHEMA — the contract =====

Every line in the progress_file is one JSON object with this envelope:

  {
    "ts":      "<RFC3339-nano>",
    "command": "record" | "test",
    "phase":   "<phase-name>",
    "message": "<optional human-readable>",
    "data":    { ... phase-specific ... }   // optional
  }

The phase vocabulary is intentionally extensible — new lifecycle phases get added over time as the CLI grows (started, agent_up, app_starting, suites_running_start, record_done, auto_replay_skipped, upload_done, linking_done, etc.). There are only TWO phases the AI must handle programmatically; everything else is informational and you should NOT switch on phase names you don't recognize:

  * phase != "done"  → keep polling. Optional: surface message/data to the dev as ambient progress ("agent is starting...", "suites uploading..."), but never branch on a specific intermediate phase name.

  * phase == "done"  → terminal event. Stop polling. The data envelope carries:
      - data.ok            bool      true on success, false on failure
      - data.error         string    (only on ok=false) one-line failure summary
      - data.test_run_id   string    (only on ok=true) pass to get_session_report
      - data.app_id        string    echo of the app_id passed to the tool
      - data.artifact_dir  string    local path to captured/replayed artifacts
      - data.dashboard_url string    UI link to drill into the run

If you observe a phase you don't recognize, IGNORE it and keep polling. If "done" itself is renamed by a future CLI version, the wait_for_done step's PID-alive guard is your safety net (the poll loop exits when the CLI dies); surface log_file contents to the dev.

===== "ALL SUITES FAILED CAPTURE" — special signal =====

If you see a `phase: "auto_replay_skipped"` event with `message: "all suites failed during rerecord; skipping replay + linking"` ahead of the terminal `done` event, every suite failed at the CAPTURE phase (before auto-replay even ran). The CLI fails closed in this case — auto-replay and suite linking are SKIPPED, so every per_suite entry comes back linked=false.

Watch for this trap: the terminal `data.ok=true` because the CLI itself completed cleanly (it didn't crash; it just had nothing to record successfully). DO NOT read data.ok=true as "rerecord succeeded" — read `<linked>/<total>`. If linked == 0, this is a HARD failure that needs diagnosis, not a partial-linkage case.

ALWAYS surface the dashboard URL on this case. The terminal `done` event still carries `data.dashboard_url` and `data.test_run_id` (atg's TestSuiteRun was created during the capture phase); emit them verbatim so the dev can drill into per-step failures in the UI:
    "0/N suites have a sandbox test — every suite failed during the capture phase, so auto-replay and linking were skipped. Dashboard: <data.dashboard_url> (test_run_id=<id>)"

EDGE CASE: if `data.test_run_id` is empty, atg never inserted a TestSuiteRun (typically a pre-flight validation failure — branch-id rejection, app unreachable, etc.). The dashboard URL won't resolve. Skip the URL, surface the log_file contents instead so the dev can read the early-stage failure.

Recovery is the same as WHEN linked=false below — read failed_steps for each suite and pick route B (fix code) / C (update suite + record again) / SKIP. Don't infra-retry; capture-phase failures across every suite usually mean the app is broken, the suite shapes are stale, or the dev's local app isn't reachable.

===== LINKAGE VERIFICATION =====

After get_session_report returns, for EVERY suite that went into this record, call getTestSuite({suite_id}) and check whether the suite has a sandbox test (linked=true / non-empty test_set_id). A suite without a sandbox test cannot be replayed — replay_sandbox_test will 400 on it with "no sandboxed tests" until a successful record produces one.

===== WHEN linked=false — recovery rules =====

A suite with linked=false after record_sandbox_test means the record process couldn't produce a sandbox test for that suite. The SUITE ITSELF still exists; it just has no sandbox test. Diagnose WHY by reading the rerecord report's failed_steps for that suite:

  * No failed_steps OR pure infra error (link-commit / upload failed, no step diverged) → call record_sandbox_test AGAIN scoped to just the unlinked suite_ids. The tool is idempotent on the suite; safe to re-run.

  * failed_steps with assertion diffs (response shape, body fields, status code shifted from what the suite expected) → the suite is stale relative to current app behavior. The CONTRACT changed:
      - Change is INTENTIONAL (new field, renamed key, different status code is the new normal) → call update_test_suite to update the affected step's response / assertions to match the new contract, THEN call record_sandbox_test on the updated suite.
      - Change is UNINTENTIONAL (app regressed) → fix the app code first, then call record_sandbox_test. No suite update needed; the original test was correct.

  * failed_steps with 500s / handler crashes / connection refused → the app is broken at the wire level. Fix the app, then call record_sandbox_test. Don't update_test_suite to absorb a real failure.

NEVER:
  * Don't call create_test_suite to "redo" the suite — it already exists; re-creating authors a duplicate (see BEFORE CREATING in create_test_suite).
  * Don't blindly loop record_sandbox_test without diagnosing failed_steps first; if the cause is suite-vs-app mismatch, retries won't help.

===== MANDATORY OUTPUT — Phase 2 section =====

Your final message to the dev MUST contain a section with this exact heading (do NOT collapse into a single pass/fail table with the rerecord report; do NOT merge with Phase 1 or Phase 3):

    ### Phase 2 — Sandbox-test linkage
    **<linked>/<total> suites have a sandbox test**

    _Suites with a sandbox test_
    | Suite name | suite_id | test_set_id | Capture pass/total |
    | --- | --- | --- | --- |
    | <name> | <suite_id> | <test_set_id> | <p>/<t> |
    (emit even if zero — one row per linked suite, or "_(none)_" in place of rows)

    _Suites without a sandbox test_  (omit ONLY if every suite linked)
    | Suite name | suite_id | Likely cause |
    | --- | --- | --- |
    | <name> | <suite_id> | gate1 / gate2 / infra |
Likely-cause decoding: assertion diffs → gate 1 upstream-replay failure; upstream-passing + mock-replay-diff → gate 2 mock-determinism mismatch; zero failures + still unlinked → infra link-commit issue.

Then proceed to replay_sandbox_test ONLY for the suites that DID link; the unlinked ones will 400 on replay.

===== DO NOT =====

  * DO NOT fall back to raw keploy CLI (`keploy rerecord -t …`) if the MCP tool drops mid-flow — the CLI subcommand runs test-sets directly and does NOT update the suite's test_set_id. See MCP DISCONNECT RECOVERY in the top-level instructions. Endpoint: https://api.keploy.io/client/v1/mcp
- replay_sandbox_test - Replay the sandbox test for one or more suites against captured mocks — re-runs the suite's steps against the dev's locally-running app while keploy serves outbound calls (DB, downstream HTTP, etc.) from the captured mocks. Use this when the dev says "replay", "run my sandbox tests", "integration-test", "check if mocks still match" — keywords "sandbox" / "replay" / "mocks" / "integration-test" all map here. Also the REPLAY STEP of FROM-SCRATCH: call this LAST (after create_test_suite + record_sandbox_test) to give the dev the whole-app regression picture against the freshly captured mocks. Output produces a SANDBOX RUN REPORT — it answers "does the suite still hold up against its captured baseline?".

═══════════════════════════════════════════════════════════════════
DISAMBIGUATION — pick this tool vs. replay_test_suite:
═══════════════════════════════════════════════════════════════════

USE replay_sandbox_test (THIS TOOL) when the dev says:
  * "run my sandbox tests" / "replay my sandbox tests"
  * "integration-test my app" / "run the integration tests"
  * "check if my mocks still match" / "replay against the captured mocks"
  * "rerun my sandbox suite" (with the word "sandbox")
  Trigger keyword: an explicit "sandbox" / "replay" / "mocks" / "integration-test" — silent signal that the dev wants captured-mock replay, NOT live-app execution.

USE replay_test_suite INSTEAD when the dev says:
  * "run the test suite" / "run my test suites" (bare — no "sandbox")
  * "execute test suite X" / "run suite 810d3ebe…"
  * "test the suite again" / "smoke test against the live app"
  Bare verbs ("run / test / execute") applied to "the suite" without the word "sandbox" mean LIVE-APP execution, NOT captured-mock replay. replay_test_suite hits the dev's running localhost app directly via HTTP — no docker spin-up, no mocks.

After a record_sandbox_test run, the natural next step is THIS tool (replay against the just-captured mocks). After create_test_suite / update_test_suite, the natural next step is replay_test_suite (validate against the live app). When the dev's verb is bare and the prior turn doesn't make the intent obvious, ASK rather than picking sandbox-replay silently — code-change regressions can hide under "mock didn't match" failures.

═══════════════════════════════════════════════════════════════════
DISCOVERY — when the dev hands you a bare suite_id with no app_id / branch_id:
═══════════════════════════════════════════════════════════════════

Suites live on a (app_id, branch_id) tuple. A bare suite_id has NO on-disk hint about which app or branch holds it; you have to RESOLVE both before calling this tool. Walk these steps in order — STOP as soon as getTestSuite returns 200:

  1. Detect the dev's git branch:  Bash `git rev-parse --abbrev-ref HEAD` in app_dir.
     If exit non-zero / output is "HEAD" → not a git repo / detached HEAD; ASK the dev for the Keploy branch name.

  2. Resolve candidate apps via the cwd basename:  Bash `basename $(pwd)` → call listApps with q=<basename>. Usually 1–2 candidates. If 0 → ASK; if >1 → walk every candidate in step 4.

  3. For each candidate app, call list_branches({app_id}) and find the branch whose `name` matches the git branch from step 1. That gives you {branch_id}. If no match → not this app, try next.

  4. Verify with getTestSuite({app_id, suite_id, branch_id=<from step 3>}). 200 → resolved; 404 → wrong app/branch, try next.

  5. If steps 2–4 exhaust, walk every OPEN branch on each candidate app via list_branches → getTestSuite. Then try main (branch_id omitted). If still nothing → ASK the dev for the {app_id, branch_id} pair.

After resolving once in a session, REUSE the {app_id, branch_id} for subsequent suite-targeted calls; don't re-walk discovery for every action.

SCOPE — whole-app vs single-suite:
  * Default: LEAVE suite_ids UNSET → the tool resolves "every suite for the app that has a sandbox test (test_set_id populated)" and replays them all. Use this for "run my sandbox tests" / "check if my tests still pass" — whole-app regression. New suites auto-pick up.
  * Single / subset: PASS suite_ids when the dev names specific suites — "replay sandbox test for suite 810d3ebe-…", "replay only the auth suite", "run suite X and Y". The tool validates each requested id is actually a suite with a sandbox test (has test_set_id); an unlinked id gets a precise "record first" error instead of an opaque downstream CLI failure.

This tool resolves the app, picks the suite set per the rule above, and returns a single playbook that drives the replay for them. It does NOT record.

WHAT THIS TOOL DOES INTERNALLY (so you don't have to):
  1. Resolves app_id — use the explicit app_id if the caller has one; otherwise pass app_name_hint (usually the cwd basename) and the server does listApps with a substring match. Multiple matches → error listing them; zero matches → error suggesting the dev generate a suite first.
  2. Lists test suites for the app, keeps only those with a non-empty test_set_id. Zero linked → typed "no linked sandbox tests" error.
  3. If suite_ids was passed, validates every requested id is in the linked-suites set; unlinked ids → typed error pointing to record_sandbox_test.
  4. Returns the headless playbook — walk it exactly: spawn CLI in background, tail the progress file (PID-alive guard built in), read the terminal event, fetch the report. No separate cleanup step — the CLI exits on its own.

===== PREREQUISITES =====
(Same as record_sandbox_test — if you just recorded, you already have them. Same docker-compose network rule applies: use the same compose file + service, stop the app service before calling, leave deps running.)
  - app_command: shell command that starts the dev's app (e.g. "docker compose up producer").
  - app_url: base URL the app listens on, e.g. http://localhost:8080.
  - app_dir: absolute path to repo root.
  - container_name if app_command is docker-compose.
  - keploy binary on PATH. If `which keploy` returns nothing, install it before calling this tool with: `curl --silent -O -L https://keploy.io/install.sh && source install.sh`.

===== AFTER CALLING — walk the playbook =====

Same headless playbook shape as record_sandbox_test: spawn `keploy test sandbox --cloud-app-id …` in the background via Bash, poll `tail -n 1 $PROGRESS_FILE` repeatedly (no sleep loops; the wait_for_done step has a built-in `kill -0 $KEPLOY_PID` guard so the loop exits if the CLI dies silently), read the terminal NDJSON event (phase=done, data.ok, data.test_run_id), and — if ok=true — call get_session_report(app_id, test_run_id) with verbose=true at the end. No separate cleanup step needed; the CLI exits cleanly once phase=done is written.

===== MANDATORY OUTPUT — Phase 3 section =====

Your final message to the dev MUST contain a section with this exact heading (do NOT merge with Phase 2; do NOT compress the failed-steps table even when failures are homogeneous):

    ### Phase 3 — Sandbox run report

Under it, emit the uniform three-subsection format owned by get_session_report:
  (i)  per-suite table — one row per suite in per_suite, passing suites included, columns = Suite name | passed/total steps.
  (ii) failed-steps table — ONE ROW per entry in failed_steps[], columns = Suite | Step name | Method + URL | Expected → Actual status | mock_mismatch y/n. Never collapse rows.
  (iii) Diagnosis + Recommendation (see get_session_report description for case-specific rules around mock_mismatch_dominant, repo-diff inspection, and the SKIP / FIX-CODE / FIX-TEST branching for fix-it follow-ups).

Do NOT print aggregate step totals across suites — they mix unrelated suites and hide where damage actually is.

===== ROLLUP LINE =====

Close the message with a final one-line rollup paragraph (no heading), in addition to the three phase sections. Mention the TOTAL number of suites replayed (which may exceed the count created in this session, because replay_sandbox_test covers every linked suite the app has). Example:
  "_Rollup: inserted 4 suites, 4/4 with sandbox tests after record, 3/4 suites passed sandbox replay across the app's 6 linked suites — 1 failure is likely keploy egress-hook, file an issue with the IDs above._"

===== DO NOT =====

  * DO NOT call update_test_suite or record_sandbox_test after this. The dev said RUN, not REFRESH.
  * DO NOT fall back to raw keploy CLI (`keploy test …`) if the MCP tool drops mid-flow — CLI runs test-sets directly and does NOT write results back to the MCP-visible TestSuiteRun. See MCP DISCONNECT RECOVERY in the top-level instructions. Endpoint: https://api.keploy.io/client/v1/mcp
- replay_test_suite - Replay an existing test suite live against the dev's LOCAL APP (no mocks, no docker spin-up). Returns a playbook that delegates to the enterprise CLI `keploy test-suite`, which walks each suite's steps, fires HTTP requests at base_path, evaluates assertions, and uploads per-suite results to api-server. The CLI prints a final pass/fail summary table plus a "Report:" URL to stdout. Output produces a TEST SUITE REPORT — it answers "does the suite hold up against the actual current system?".

═══════════════════════════════════════════════════════════════════
DISAMBIGUATION — pick this tool vs. replay_sandbox_test:
═══════════════════════════════════════════════════════════════════

USE replay_test_suite (THIS TOOL) when the dev says:
  * "run the test suite" / "run my test suites"
  * "execute test suite X" / "run suite 810d3ebe…"
  * "test the suite again" / "rerun the suite"
  * "validate the suite changes" (after editing a suite)
  * "smoke test against the live app"
  Default reading: bare verbs "run" / "execute" / "test" applied to "the suite" mean LIVE-APP execution, NOT replay against captured mocks.

USE replay_sandbox_test INSTEAD when the dev says:
  * "run my sandbox tests" / "replay my sandbox tests"
  * "integration-test my app" / "check if my mocks still match"
  * "replay the captured tests" / "run against the recorded mocks"
  Trigger keyword: "sandbox" / "replay" / "mocks" / "integration-test" — explicit signal that the dev wants captured-mock replay, not live-app.

After a record_sandbox_test run, the natural next step is replay_sandbox_test (replay against the freshly captured mocks). After create_test_suite / update_test_suite, the natural next step is replay_test_suite (validate the new/edited suite against the live app). When the dev's verb is bare ("run the suite") and the prior turn was create/update, prefer THIS tool. When the prior turn was record, ASK the dev if unsure — the verbs overlap and silently picking sandbox-replay can mask code-change failures with mock-replay noise.

USE THIS for: re-running previously-created suites against a running local app — verifying a regression after a code change, smoke-testing a branch, re-validating after editing a suite.

DO NOT USE this for: validating a NEW suite that hasn't been inserted yet (use create_test_suite — it runs the suite twice as part of validation), or for running suites against the captured-mock copy of the app (use replay_sandbox_test — captured-mock replay flow).

═══════════════════════════════════════════════════════════════════
DISCOVERY — when the dev hands you a bare suite_id with no app_id / branch_id:
═══════════════════════════════════════════════════════════════════

Suites live on a (app_id, branch_id) tuple. A bare suite_id has no on-disk hint about which app or branch holds it; you have to RESOLVE both before calling this tool. Walk these steps in order — STOP as soon as getTestSuite returns 200:

  1. Detect the dev's git branch:  Bash `git rev-parse --abbrev-ref HEAD` in app_dir.
     If exit non-zero / output is "HEAD" → not a git repo / detached HEAD; ASK the dev for the Keploy branch name (don't invent one).

  2. Resolve candidate apps via the cwd basename:  Bash `basename $(pwd)` → call listApps with q=<basename> (case-insensitive substring match). Usually 1–2 candidates (e.g. "orderflow" → matches "orderflow" and "orderflow.producer"). If 0 → ASK the dev for the app_id; if >1 → walk every candidate in step 4.

  3. For each candidate app, call list_branches({app_id}) and find the branch whose `name` matches the git branch from step 1. That gives you {branch_id, status}. If no match → that app's not the owner; try the next candidate. If status is closed/merged → ask the dev whether to use this branch anyway.

  4. Verify with getTestSuite({app_id, suite_id, branch_id=<from step 3>}). 200 → resolved; 404 → wrong app, try next candidate.

  5. If steps 2–4 exhaust without a hit, the suite is on a branch whose name doesn't match the git branch (the dev created it with a custom name, or it's on main). Then: call list_branches on each candidate app and try every OPEN branch's branch_id with getTestSuite, then try main (branch_id omitted). If still nothing → ASK the dev for the {app_id, branch_id} pair.

The reverse "look up suite_id globally" path doesn't exist — auditing is branch-scoped, so resolution starts from a branch context. After resolving once in a session, REUSE the {app_id, branch_id} for any subsequent suite-targeting call (delete_test_suite / update_test_suite / replay_test_suite); don't re-walk discovery for every action.

═══════════════════════════════════════════════════════════════════
INPUTS
═══════════════════════════════════════════════════════════════════

  * app_id (required) — Keploy app ID. Same value used for create_test_suite / list_branches.
  * branch_id (required) — Keploy branch UUID. Resolve via the explicit two-step flow BEFORE calling: (1) Bash `git rev-parse --abbrev-ref HEAD` in app_dir; (2) call create_branch tool with {app_id, name: <git branch>} — find-or-create returns {branch_id, ...}; pass it here. Direct main writes are blocked.
  * base_path (required) — base URL of the dev's local app, e.g. http://localhost:8080. Each suite step's relative path is appended to this.
  * suite_ids (optional) — list of suite IDs to run. Omit / empty = run every suite registered for app_id on the branch.
  * header (optional) — single header to inject into every request, e.g. "Cookie: session=…". Same shape as the CLI's -H flag.
  * app_dir (optional) — absolute path to the dev's repo root (where the app is running). Defaults to '.' (cwd). The CLI invocation cd's here.

═══════════════════════════════════════════════════════════════════
HOW THIS TOOL WORKS
═══════════════════════════════════════════════════════════════════

This tool DOES NOT execute the suite itself. It returns a "playbook" — a small array of shell steps for you (Claude) to walk via Bash. The playbook spawns the enterprise CLI `keploy test-suite` in the foreground; the CLI:
  1. Validates the branch exists + is writable (fails fast with a clear message if not).
  2. Loads suites from api-server (filtered by --suite-id when supplied; otherwise every suite on the branch).
  3. For each suite: fires step requests at base_path, evaluates assertions, records per-step results.
  4. Uploads a TestSuiteRun + TestSuiteReport entry to api-server (?branch_id=<uuid>).
  5. Prints a summary table to stdout, exits 0 on all-pass / 1 on any failure.

Walk the playbook in order. Surface the CLI's stdout to the dev — the table shows which suites passed / failed / were "buggy" (suite-level verdict separate from individual step failures).

PREREQUISITES the playbook assumes:
  * The dev's app is up and reachable at base_path.
  * `keploy` binary is on PATH. If missing, install before calling this tool: `curl --silent -O -L https://keploy.io/install.sh && source install.sh`.
  * Either ~/.keploy/cred.yaml exists (API key) or KEPLOY_API_KEY is exported. Endpoint: https://api.keploy.io/client/v1/mcp
- revokeAPIKey - DELETE /api-keys/{keyId} — Revoke an API key — Requires scope: `admin`. Endpoint: https://api.keploy.io/client/v1/mcp
- runTestSuites - POST /apps/{appId}/test-suites/run — Run test suites — Run test suites against a PUBLIC target URL. DO NOT use for local-app / localhost runs — base_url must be reachable from the SaaS backend (rejects loopback / private IPs as 400 'invalid baseURL'). For localhost runs use the MCP tool record_sandbox_test (keploy agent). Optional sandbox_mode field: ""|"rerecord"|"integration_test" — the sandbox modes are primarily used through MCP's record_sandbox_test / replay_sandbox_test tools. Requires scope: `write`. Endpoint: https://api.keploy.io/client/v1/mcp
- run_and_report - Run test suites and return results with failures and coverage.

!! DO NOT USE for local-app "tests for my changes" flows !!
This tool sends the run to the SaaS backend which REJECTS private/localhost URLs ("IPv6 address is private / reserved"). It only works when base_url points at a PUBLIC, non-loopback address (a staging/prod deployment).

For local-app testing, use record_sandbox_test / replay_sandbox_test instead — they drive the keploy local agent which happily records against http://localhost. Endpoint: https://api.keploy.io/client/v1/mcp
- scaffold_pipeline_workflow - Generate the exact CI workflow YAML to add keploy sandbox tests to a pull-request pipeline, and tell you where to write it. Use this when the dev asks to "add keploy sandbox tests to my pipeline" / "wire keploy into CI" / "run keploy on PR" / "add a CI job for keploy" — the server emits the file contents verbatim so you don't have to compose the flag list yourself.

===== GOAL =====

Write a CI workflow file that runs `keploy test sandbox --cloud-app-id <uuid> --app-url <url>` on pull requests and gates the PR on the result. NEVER kick off an actual test run in this flow — it is pure file authoring, ends with the file on disk. DO NOT fire replay_sandbox_test, record_sandbox_test, replay_test_suite, or any other run-starting MCP tool here.

===== HOW (absolute) =====

Call this tool. It returns { file_path, content, summary }. Write the "content" to "file_path" VERBATIM via your Write tool — NO flag renames, NO flag removals, NO step reordering, NO synthesis. The server owns the YAML template; your job is only to (1) resolve the inputs from the repo and api-server and (2) Write the returned content. Do NOT compose the YAML yourself from general knowledge — flag drift (missing --cloud-app-id, inventing --app) is the most common bug when Claude improvises.

DO NOT ASK the dev for confirmation before writing. Resolve everything from the repo + api-server, pick the GitHub Actions default, call this tool, Write the file. The dev's prompt is already the go-ahead.

===== STEPS =====

1. DETECT THE CI SYSTEM:
   * Default = GitHub Actions (biggest share). File = .github/workflows/keploy-sandbox.yml.
   * If .gitlab-ci.yml exists → GitLab (not yet supported by this tool; tell the dev and stop).
   * If .circleci/config.yml exists → Circle (not yet supported; tell the dev and stop).
   * Otherwise → GitHub Actions.

2. RESOLVE VALUES by calling MCP tools + reading the repo:
   * app_id: call listApps({q: "<cwd basename>"}). Exactly one → use its id. Multiple → pick the one whose name most specifically matches the repo's primary service (e.g. "orderflow.producer" wins over "orderflow" when there's a ./producer directory); mention which you picked in the final message. Zero → stop and tell the dev to create the app + rerecord first.
   * suite_ids: DO NOT pass this arg by default. An empty suite_ids means the CLI resolves "every linked sandbox suite for the app" at CI run time — which is what you want (new suites auto-pick up without workflow edits). The tool still verifies there's ≥1 linked suite at scaffold time so the first PR run doesn't fail empty-handed. Only pass suite_ids when the dev explicitly narrows ("run only the auth suite in CI"); don't pin "all current suites" — that's staleness waiting to happen.
   * compose_file: READ THE REPO. Default is docker-compose.yml. AVOID passing a docker-compose-keploy.yaml variant that has `networks: default: external: true` — those variants only work locally, where another compose run has already created the external network. In CI the runner starts clean and `external: true` fails with "network not found". If the primary docker-compose.yml brings up the full app (deps + app service), use it end-to-end.
   * app_service, container_name, app_port: read from the SAME compose_file you picked above. app_service = the service key (e.g. "producer"); container_name = that service's container_name: field in that same compose file (e.g. "orderflow-producer" if compose_file=docker-compose.yml, but "producer" if compose_file=docker-compose-keploy.yaml — THESE DIFFER, pick consistently); app_port = the host-side of its ports: mapping.
   * app_url = http://localhost:<app_port>. The tool derives this; you don't pass it separately.

3. CALL THIS TOOL with app_id, app_service, container_name, app_port, compose_file (and suite_ids only if the dev explicitly narrowed scope). It returns { file_path, content, summary }. Write the "content" to the "file_path" VERBATIM.

===== FLAG NAME RULES (absolute, do not drift when reviewing the output) =====

  * `--cloud-app-id`  ← NOT `--app-id`. The OSS config has an `appId` uint64 field that viper maps `--app-id` into; passing a UUID there fails with "invalid syntax" before RunE runs.
  * `keploy test sandbox --cloud-app-id <uuid> --app-url <url>`  ← the CI form. NOT `keploy test --cloud-app-id` (must be `test sandbox` — the headless flags live on the sandbox subcommand only), NOT `keploy test-suite run` (that command doesn't exist). There is NO `--pipeline` flag.
  * Install URL = `https://keploy.io/ent/install.sh` ← NOT `https://keploy.io/install.sh` (OSS; no sandbox subcommand at all), NOT a github.com/keploy/keploy release tarball.

If the server-emitted content ever disagrees with these rules, trust the server output and file a bug — don't edit the YAML.

===== RESOLUTION ARGS =====
  * Pass either app_id (explicit UUID) or app_name_hint (substring; server does listApps and requires exactly one match).
  * Pass app_service (docker-compose service name), container_name (from compose container_name: field read from the SAME compose_file arg), and app_port (HTTP port the service exposes).
  * compose_file is optional, defaults to "docker-compose.yml". If the repo has a -keploy.yaml variant with `external: true` networks, do NOT point compose_file at it — it won't work in CI.
  * suite_ids is optional and should be LEFT BLANK by default — the CLI resolves every linked suite at run time. Only pin an explicit list when the dev narrows scope.

===== FINAL RESPONSE — three short sections, no questions =====

    ### Created
    | File | Lines |
    | --- | --- |
    | .github/workflows/keploy-sandbox.yml | N |
    ### Summary
    - App: <name> (<app_id>), <N> linked suites replayed on every PR
    - Trigger: pull_request → main, + manual workflow_dispatch
    - Failure on any suite gates the PR (non-zero exit from the CLI)
    ### Before the first run, add this GitHub secret
    - `KEPLOY_API_KEY` — at https://github.com/<owner>/<repo>/settings/secrets/actions/new
      (self-hosted users — point at your own api-server by building the
      enterprise binary with -X main.api_server_uri=<url>; there is no
      runtime env override on the released binary.)

This tool does NOT run anything. It only generates file contents. Endpoint: https://api.keploy.io/client/v1/mcp
- startLoadTest - POST /apps/{appId}/load-tests — Start a load test — Requires scope: `write`. Endpoint: https://api.keploy.io/client/v1/mcp
- stopJob - POST /jobs/{jobId}/stop — Stop a running job — Requires scope: `write`. Endpoint: https://api.keploy.io/client/v1/mcp
- stopLoadTest - POST /apps/{appId}/load-tests/{runId}/stop — Stop a running load test — Requires scope: `write`. Endpoint: https://api.keploy.io/client/v1/mcp
- streamJobEvents - GET /jobs/{jobId}/events — Stream job events (SSE) — Requires scope: `read`. Returns a `text/event-stream`. Endpoint: https://api.keploy.io/client/v1/mcp
- streamLoadTestEvents - GET /apps/{appId}/load-tests/{runId}/events — Stream load test events (SSE) — Requires scope: `read`. Returns a `text/event-stream`. Endpoint: https://api.keploy.io/client/v1/mcp
- unlink_mock - Detach a mock from a test case's mock-mapping doc. Idempotent — unlinking an already-unlinked mock is a no-op.

`branch_id` is REQUIRED — direct writes to main via MCP are blocked.

Mirror tool: link_mock. Endpoint: https://api.keploy.io/client/v1/mcp
- updateApp - PUT /apps/{appId} — Update an app — Requires scope: `write`. Endpoint: https://api.keploy.io/client/v1/mcp
- updateMock - PUT /apps/{appId}/recordings/{testSetId}/mocks/{mockId} — Replace one mock's stored data — Updates the mock identified by `{mockId}` in the path. The body
carries the full replacement. Two shapes — pick ONE:

- **`mock_yaml`** (PREFERRED) — canonical mock YAML envelope.
  See createMock for the field-loss rationale.
- **`mock`** — typed OSS Mock JSON. Brittle for lowercase keys.

When both are sent, `mock_yaml` wins.

Branch-aware via the optional `branch_id` body field — same
semantics as createMock.

Accepts both the mock's UUID `_id` and its human-readable Name
as `mockId` — the server resolves names within the test set, so
UI callers (which don't have access to the mock's `_id`) can
pass the Name directly.

Requires scope: `write`.
 Endpoint: https://api.keploy.io/client/v1/mcp
- updateTestCase - PUT /apps/{appId}/recordings/{testSetId}/test-cases/{testCaseId} — Update a test case — Update mutable fields of a recorded test case (name, http_req, http_resp). Pass `branch_id` to scope the edit to a branch overlay (bundle-uploaded test cases on a branch are invisible to main writes). Requires scope: `write`. Endpoint: https://api.keploy.io/client/v1/mcp
- updateTestSuite - PUT /apps/{appId}/test-suites/{suiteId} — Update a test suite — Requires scope: `write`. Endpoint: https://api.keploy.io/client/v1/mcp
- update_mock - Replace one mock's stored data. Accepts either the mock's UUID id or its human-readable Name as `mock_id` — the server resolves names within the test set, so MCP callers that don't have the UUID can pass the Name directly.

Body must include the full replacement Mock. Same dual authoring shape as create_mock: `mock_yaml` (preferred) or `mock` (typed). `branch_id` is REQUIRED — direct writes to main via MCP are blocked. Endpoint: https://api.keploy.io/client/v1/mcp
- update_test_suite - Edit an existing test suite — change one or more step bodies, assertions, headers, or remove/add steps. Returns a playbook that delegates to `keploy update-test-suite`, which validates the new state (static structural checks + 2 live runs for idempotency + GET-coupling check) and snapshot-replaces the suite via api-server.

POST-EDIT BEHAVIOUR: any structural change here (step method/url/body/headers/extract/assert, or add/delete steps) AUTOMATICALLY clears the suite's sandbox test server-side — the suite comes back as linked=false. Call record_sandbox_test on the updated suite before any sandbox replay; otherwise replay_sandbox_test will 400 with "no sandboxed tests". Cosmetic-only edits (name, description, labels) preserve the sandbox test.

═══════════════════════════════════════════════════════════════════
FETCH-FIRST RULE — required for the edit to be accepted:
═══════════════════════════════════════════════════════════════════

The api-server's replace handler rejects updates that preserve ZERO step IDs from the existing suite ("full rewrite, not an edit"). To make a real edit:

  1. Call getTestSuite first (or use download_recording / get_app_testing_context if you already have the suite). Capture each existing step's "id" field.
  2. Compose your new steps_json INCLUDING the existing "id" on every step you want to KEEP or EDIT. Omit "id" only on steps you're ADDING. Drop a step entirely from steps_json to DELETE it.
  3. Call this tool with that merged steps_json.

If you author a fresh JSON without the existing step IDs, the server rejects it with "preserves no steps from the existing suite". When that happens, your two options are: (a) re-author with IDs preserved (preferred — keeps history), or (b) call delete_test_suite then create_test_suite (loses history, fresh suite_id).

═══════════════════════════════════════════════════════════════════
DISCOVERY — when the dev hands you a bare suite_id with no app_id / branch_id:
═══════════════════════════════════════════════════════════════════

Suites live on a (app_id, branch_id) tuple. A bare suite_id has no on-disk hint about which app or branch holds it; you have to RESOLVE both before calling this tool. Walk these steps in order — STOP as soon as getTestSuite returns 200:

  1. Detect the dev's git branch:  Bash `git rev-parse --abbrev-ref HEAD` in app_dir.
     If exit non-zero / output is "HEAD" → not a git repo / detached HEAD; ASK the dev for the Keploy branch name.

  2. Resolve candidate apps via the cwd basename:  Bash `basename $(pwd)` → call listApps with q=<basename>. Usually 1–2 candidates. If 0 → ASK; if >1 → walk every candidate in step 4.

  3. For each candidate app, call list_branches({app_id}) and find the branch whose `name` matches the git branch from step 1. That gives you {branch_id}. If no match → not this app, try next.

  4. Verify with getTestSuite({app_id, suite_id, branch_id=<from step 3>}). 200 → resolved; 404 → wrong app/branch, try next.

  5. If steps 2–4 exhaust, walk every OPEN branch on each candidate app, then try main (branch_id omitted). If still nothing → ASK the dev for the {app_id, branch_id} pair.

The getTestSuite call in step 4 is the one whose response you also use to capture every step's existing "id" for the FETCH-FIRST RULE above — so step 4 is actually a 2-for-1: discovery AND fetch-first happen on the same call.

After resolving once in a session, REUSE the {app_id, branch_id} for subsequent suite-targeted calls; don't re-walk discovery for every action.

═══════════════════════════════════════════════════════════════════
INPUTS
═══════════════════════════════════════════════════════════════════

  * app_id (required) — Keploy app id
  * suite_id (required) — UUID of the suite to update
  * branch_id (required) — Keploy branch UUID (resolve via the two-step flow before calling)
  * steps_json (required) — JSON array of the FULL desired step list. Each kept step MUST carry the existing "id". Same step shape as create_test_suite (response, extract, assert, etc — all static structural checks apply).
  * name / description / labels (optional) — overrides for top-level suite metadata
  * app_url (required) — base URL of the dev's running local app, e.g. http://localhost:8080. The CLI fires the new state TWICE against this for the idempotency check + GET-coupling check.
  * app_dir (optional) — repo root the CLI cd's into; defaults to "."

═══════════════════════════════════════════════════════════════════
HOW THIS TOOL WORKS
═══════════════════════════════════════════════════════════════════

This tool DOES NOT call api-server itself. It returns a 3-step playbook for you (Claude) to walk via Bash — same shape as create_test_suite:
  1. Write merged JSON to a temp file.
  2. Run `keploy update-test-suite --suite-id <id> --file <path> --branch-id <uuid> --base-url <url>` — runs every static structural check, fires the new state twice locally, applies the GET-coupling check, then POSTs the snapshot-replace.
  3. Cleanup the temp file.

Walk the playbook in order. If step 2 exits non-zero, surface stdout to the dev — it has the rule violation / failure detail.

OUTCOMES the AI should recognize:
  * Exit 0 + stdout has "✓ suite updated:" + "View:" line → success. Surface the View URL to the dev.
  * Exit 1 + "preserves no steps from the existing suite" → fetch-first rule was missed. Re-author with step IDs preserved (or call delete_test_suite + create_test_suite as the documented escape hatch).
  * Exit 1 + structural-check violations → fix the suite per the violation messages, then REWRITE the suite file via Bash and RE-RUN this CLI command directly. DO NOT call update_test_suite again to retry — the playbook + file path are already valid; only the JSON content needs revision. The validator output includes a canonical step skeleton on structural failures.
  * Exit 2 + "couldn't reach the dev's app" → ensure the app is up at app_url and retry.

PREREQUISITES the playbook assumes:
  * The dev's app is up and reachable at app_url.
  * `keploy` binary is on PATH. If missing, install before calling this tool: `curl --silent -O -L https://keploy.io/install.sh && source install.sh`.
  * Either ~/.keploy/cred.yaml exists or KEPLOY_API_KEY is exported. Endpoint: https://api.keploy.io/client/v1/mcp
- uploadRecordingBundle - POST /apps/{appId}/recordings/bundle — Atomic test set + cases + mocks + mappings ingest — Bundle ingest — creates the test set, every test case, every
mock, and the mapping doc in a single call. Each step is its
own DB write; partial failure leaves earlier rows in place,
callers can replay safely.

Branch-aware via optional `branch_id` — when set, every row
lands on the overlay until merge.

Use this when authoring a recording from scratch (LLM
workflows, CLI imports). For incremental edits, prefer the
per-resource endpoints (`createMock`, `createTestCase`, etc.).

Requires scope: `write`.
 Endpoint: https://api.keploy.io/client/v1/mcp
- upload_recording - Atomic test set + cases + mocks + mappings ingest. Creates the test set row, every test case, every mock, and the mapping doc in one call.

PREFER THE CLI FOR ON-DISK RECORDINGS. When the dev has a recorded test-set on disk (e.g. `./keploy/test-set-0/` produced by `keploy record`), invoke this via Bash instead — it streams bytes from disk to server in one HTTP round-trip:

```
keploy upload test-set \
  --app <namespace.deployment>   # or --cloud-app-id <uuid>
  --branch <uuid|name>           # optional, find-or-create on name
  --test-set <path|name>         # e.g. keploy/test-set-0
  [--name <override>]            # rename on the server
```

The CLI path runs in ~3 seconds for a typical recording; calling this MCP tool directly with the same bundle inlined as args takes minutes because Claude has to serialize ~10K+ tokens of YAML/JSON through tool_use. Reserve this MCP tool for cases where the data is already in conversation context (e.g. you just generated test cases programmatically and don't want to round-trip to disk).

Each step is its own DB write; partial failure leaves earlier rows in place — callers can replay safely. `branch_id` is REQUIRED — direct writes to main via MCP are blocked. Every row lands on the branch overlay until merge.

`test_cases[].mock_names` lists the mocks each case consumes; the server folds these into the mapping doc on upload.

Returns { test_set, test_case_ids, mock_ids }. Endpoint: https://api.keploy.io/client/v1/mcp
- validateTestSuite - POST /apps/{appId}/test-suites/{suiteId}/validate — Validate a test suite — Run the suite against a public, non-loopback base URL to capture responses and run assertions. DO NOT use for local-app / localhost validation — the SaaS backend rejects private IPs with 500. For local apps, curl endpoints yourself (Bash) and pass the captured responses into create_test_suite directly. Requires scope: `write`. Endpoint: https://api.keploy.io/client/v1/mcp

## Resources
Not captured

## Prompts
Not captured

## Metadata
- Owner: io.github.keploy
- Version: 1.0.0
- Runtime: Streamable Http
- Transports: HTTP
- License: Not captured
- Language: Not captured
- Stars: Not captured
- Updated: Apr 17, 2026
- Source: https://registry.modelcontextprotocol.io
