Say you're building an AI agent capable of handling this request:

Investigate elevated 5XXs for tenant Acme in prod. Check metrics, search logs, correlate with recent deployments, summarize the likely cause, and if it is safe restart the affected instance.

I bet you'd start with the prompt, right? Adding instructions like Make sure you find the actual root cause, Read the logs carefully, Verify your findings before taking destructive actions. Or even better, Do not make any other changes to AWS resources other than restarting affected instances.

You're thinking about the right problem: Let's prevent this agent from doing something bad. We're not dealing with a chatbot, which might just say something incorrect. The failure mode for this agent is that it might generate a perfectly valid tool call with the wrong tenant, the wrong environment, or the wrong parameter, and everything still looks syntactically correct on the way down. Looks right, kabooms the wrong thing.

A prompt is not a permission model. It is a hopeful string, which your LLM will strictly obey until it doesn't.

In both a chatbot that outputs something incorrect and an operations agent that kills the wrong server, we're looking at the same root cause: LLMs hallucinate. The difference between an incorrect answer and 💥 is in what the agent can do. The key point here is that once an agent can run tools, it effectively acts as a principal, and its execution backplane has to be secured the way you would secure an internal platform API. In this article I'll walk you through a solution for that.

This article is sponsored by Depot

At Depot we use AWS for our CI builds. To provide the fastest experience we create a standby pool of machines that are warmed and ready to take jobs.

The problem: How do you right-size standby pools when too few means latency spikes and too many means paying for idle instances?

Instead of guessing, we built a simulator using real customer data. We fed real job data in and compared results against our real-world latency and cost. We used hyperparameter optimization to test several thousand simulated parameter combinations in ~10 minutes.

Results:

  • Faster Builds: We decreased p99 latency by two seconds through simulation-driven scaling.

  • Lower Costs: Standby pool expenses dropped by 2%, proving that"faster" doesn't have to mean "more expensive.”

This is the kind of engineering that goes into every part of Depot. We simulate, measure, and optimize so your builds are faster and cheaper without you having to think about it.

Read more about our simulation here. And If you want a CI platform built by a team that sweats the details, give Depot a try.

Autonomy is a Ladder, Not a Switch

One of the easiest ways to get agent design wrong is to talk about autonomy as if there are only two modes: human-in-the-loop or fully autonomous. That framing is too coarse to be useful. Here's a better model: a five-level ladder with explicit promotion criteria between levels, where every step up the ladder increases the blast radius, and it also changes what you need from evals, observability, and permissions. If you cannot prove correctness, traceability, and containment at one level, promoting the system to the next one is an operational liability.

Level 0: Suggest

At Level 0, the agent does not execute anything. It proposes actions, plans, or outputs, but it does not call tools and it does not mutate state. An example would be a chatbot that just outputs migration plans, remediation suggestions, or cost-saving recommendations.

The main questions here are whether the output is any good, whether it follows the expected format, whether it hallucinates, and whether retrieval is correct when grounding or RAG is involved. Observability is still important even at this level: prompt version, model version, retrieval traces, output logging with redaction, latency, cost per request, and user feedback. Permissions should remain effectively read-only, apart from things like logging and protective controls such as PII redaction or secrets filtering.

Level 1: Execute read-only actions

This is where most production operations agents should start. At Level 1, the agent can call tools, but only tools that are read-only and non-mutating. Actions like querying logs, fetching metrics, reading tickets, inspecting configuration, summarizing incident timelines, analyzing invoices, or enumerating resources.

On paper this sounds exactly like RAG, but in practice the difference is that the agent is now selecting tools and constructing parameters, and that changes your evals. You now care about whether the agent chose the right tool, whether it passed the right filters and scopes, whether summaries remain faithful to source material, and whether the system avoids unintended state changes or accidental use of mutation-capable tools. Observability also gets stricter: you want traces of tool calls, correlation IDs tying model calls to tool calls and outputs, and latency and cost per step, not just per request. Permissions should still be tightly scoped read-only credentials, with least privilege, tenant boundaries, environment separation, and time-bounded access for longer-running work.

Level 2: Execute reversible writes

This is where agents start making changes, but only changes with naturally limited downstream impact or with explicit reversibility. Things like creating tickets, posting Slack updates, tagging resources, updating dashboards, opening PRs but not merging them, writing to append-only logs, or initiating workflows with explicit rollback.

This is a very different category from read-only operations, even if the business impact still looks modest. At this level, it is no longer enough to know that the agent produced a sensible plan. You need to know that it took the correct action for the scenario, that it behaves idempotently, that it does not duplicate actions, that it respects its boundaries, and that it behaves predictably when tools or actions fail. Observability needs to become mutation-aware: audit logs of intended changes, rationale for the change, diffs where applicable, and visibility into retries. Permissions should be limited write access, scoped to reversible endpoints, with request tokens or similar controls enforced by the tool layer. And it's where we start caring about explainability: By looking at the observability data we collect, can we explain why the agent took the action that it took?

Level 3: Execute bounded writes

At this level the write actions can have a more meaningful (or severe I guess) impact, and the controls have to become more explicit. The agent can now perform higher-impact actions, but only inside hard constraints: scoped permissions, quotas, explicit allowlists, deny-by-default rules, and human approvals for specific action classes. This includes applying configuration changes in bounded domains, scaling resources within limits, modifying policies within constraints, merging PRs with approvals, or initiating controlled migrations.

This is also the level where just giving an agent a role is no longer enough. A role is great for identity, but that alone is too coarse for permissions. The authorization question is whether the agent can perform this action, on this resource, in this environment, under these constraints. Evals need to cover policy compliance, quota limits, approval requirements, environment boundaries, and near-miss scenarios designed to tempt the agent into unsafe shortcuts, such as prompt injection, ambiguous intent, or partial context. Observability has to capture the full action path from intent to plan to approval to execution to outcome, with live signals for anomalies like spend spikes, permission denials, or repeated failures. Permissions at this level should be narrow, deny-by-default, and enforced with explicit escalation paths and platform-level quota controls.

Level 4: Unsupervised execution with continuous monitoring and rollback guarantees

Level 4 is a true “autonomous agent”. Note how far beyond we are from a clever prompt and a few tools. We are talking about a controlled operating environment. At this level, the agent executes end-to-end workflows without human approval in most cases, but only under continuous monitoring, automated containment, and rapid rollback or compensating-action guarantees. It's a closed-loop execution for well-defined workflows such as auto-remediation, routine cost optimization, automated incident response under strict playbooks, and continuous governance tasks.

The eval bar needs to rise accordingly: we need high-confidence coverage of critical paths, ongoing monitoring, failure injection, adversarial inputs, and economic guardrails. Observability needs complete replayability of decisions, tool calls, retrieved sources, and state transitions, plus anomaly detection and workflow SLOs such as success rate, time-to-completion, and rollback time. Permissions at this level should have containment as part of the design: session isolation, sandboxing, automated circuit breakers that can revoke credentials and halt execution, and explicit rollback or compensating actions for every mutation-capable tool.

Using the Autonomy Ladder

The higher you go in this ladder, the less acceptable it becomes to manage permissions with just prompts. Level 0 can tolerate relatively loose boundaries because nothing executes. Level 1 already needs real tool selection controls and parameter discipline. By Levels 2 through 4, permissions have to move from “the agent has a role” to “each tool call is authorized”, with strong authentication, Role-Based Access Control (RBAC) for coarse access, Fine-Grained Access Control (FGAC) for tool and parameter constraints, dual-identity delegation, and short-lived credentials. In other words, the more autonomy you want, the more authorization has to become infrastructure instead of text in a prompt.

Our Example Agent Explained

As I mentioned, we're going to build an agent that can answer to this query:

Investigate elevated 5XXs for tenant Acme in prod. Check metrics, search logs, correlate with recent deployments, summarize the likely cause, and if it is safe restart the affected instance.

The first part, checking metrics and searching logs, is level 1 of autonomy: read only actions. Once we add the ability to restart instances, we're turning it to level 4: unsupervised execution.

Our agent's tools look roughly like this:

  • CloudOps__read_metrics

  • CloudOps__search_logs

  • CloudOps__list_recent_deployments

  • CloudOps__create_incident_ticket

  • CloudOps__restart_instance

Our solution is going to involve the following AWS services:

I'm not going to explain those services from scratch, but I will do my best to give you context on what we're doing with these services. If you want to dive deeper, I recommend this list of resources collected by my friend and AWS Developer Advocate Elizabeth Fuentes Leone.

AgentCore operations agent with layered authorization

This is what should happen when the user asks the agent to investigate the Acme outage:

  1. The user authenticates with Cognito or another OIDC provider and gets a JWT.

  2. The client invokes the agent running in AgentCore Runtime with that bearer token.

  3. Runtime validates the JWT using the configured discovery URL and client/audience constraints.

  4. Runtime starts the Strands-based operations agent.

  5. Runtime passes the validated Authorization header into the agent only because we explicitly allowlist it.

  6. The Strands agent talks to AgentCore Gateway over MCP.

  7. Gateway validates inbound auth again for the tool boundary.

  8. AgentCore Policy evaluates the actual tool call: principal tags from JWT claims plus context.input from tool arguments.

  9. If Gateway allows the tool call, the target API or Lambda behind the tool performs its own Verified Permissions check against real business entities like tenant, team ownership, and environment.

  10. The downstream tool executes only if both boundaries agree.

  11. Runtime, Gateway, Policy, and the tool layer emit enough telemetry that you can reconstruct the path later.

The important thing to notice is that there are three different permission decisions here:

  • Who may invoke the runtime or gateway at all

  • Whether the agent is allowed to invoke a certain tool

  • Whether the action executed by the tool is authorized on the resource

Let's explore and implement each of those separately.

Who Can Invoke the Agent

AgentCore Runtime supports IAM SigV4 or JWT bearer tokens as inbound auth modes. A given runtime version uses one or the other, not both. Gateway works the same: you pick an authorizer type when you create it, and if you use CUSTOM_JWT, you configure a discovery URL plus claim restrictions such as allowedClients, allowedAudience, allowedScopes, and optional custom claim validations. For heterogeneous clients, JWT is usually the cleaner choice.

AgentCore also supports resource-based policies on Runtime, Runtime endpoints, and Gateway resources. These policies control who can invoke and manage those resources, and they are evaluated together with identity-based policies. If you are invoking a Runtime endpoint, AWS evaluates both the Runtime resource and the endpoint resource. If either resource denies or lacks the required allow, the request fails.

Important note: If your Runtime or Gateway is configured for OAuth/JWT auth, the resource-based policy must use a wildcard principal. You should not list end users in the Principal field the way you would with SigV4. AWS validates the OAuth token before policy evaluation, and then the resource policy can further restrict access with condition keys like aws:SourceVpc or aws:SourceVpce. Also, the Resource field must contain the exact ARN of the attached resource, * is not valid.

Here is a Runtime resource policy for an OAuth-authenticated Runtime restricted to a specific VPC:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowOAuthFromVPC",
      "Effect": "Allow",
      "Principal": "*",
      "Action": "bedrock-agentcore:InvokeAgentRuntime",
      "Resource": "arn:aws:bedrock-agentcore:us-west-2:111122223333:runtime/AGENTID",
      "Condition": {
        "StringEquals": {
          "aws:SourceVpc": "vpc-1a2b3c4d"
        }
      }
    }
  ]
}

The same pattern applies to Gateway resource policies, except the action should be bedrock-agentcore:InvokeGateway and the resource should be the Gateway ARN. If you are using Runtime endpoints, remember that both the Runtime and the endpoint resource policies are in play.

Here is a minimal client-side invoke path:

import json
import os
import urllib.parse
import uuid

import requests

REGION = "us-east-1"
RUNTIME_ARN = os.environ["AGENT_RUNTIME_ARN"]
TOKEN = os.environ["TOKEN"]

escaped_runtime_arn = urllib.parse.quote(RUNTIME_ARN, safe="")
url = (
    f"https://bedrock-agentcore.{REGION}.amazonaws.com/"
    f"runtimes/{escaped_runtime_arn}/invocations?qualifier=DEFAULT"
)

headers = {
    "Authorization": f"Bearer {TOKEN}",
    "Content-Type": "application/json",
    "X-Amzn-Bedrock-AgentCore-Runtime-Session-Id": str(uuid.uuid4()),
}

payload = {
    "prompt": (
        "Investigate elevated 5XXs for tenant acme in prod. "
        "Check metrics, search logs, correlate with recent deployments, "
        "summarize the likely cause, and if it is safe restart the affected instance."
    )
}

response = requests.post(url, headers=headers, data=json.dumps(payload))
print(response.status_code)
print(response.text)

Notice how I'm including the bearer token? Without that, we wouldn't be able to call the agent. Well, we won't be, after we do the next step.

Strands Agent in AgentCore Runtime

The first Runtime configuration step is to enable JWT auth and explicitly allowlist the Authorization header so Runtime can pass it into agent code. Runtime can read allowlisted request headers via context.request_headers. Since Runtime already validated the token, your code does not need to revalidate signature just to pass the token onward for policy context.

Here is the Runtime configuration you should use:

agentcore configure --entrypoint src/main.py \
  --name ops-agent \
  --execution-role arn:aws:iam::123456789012:role/OpsAgentRuntimeRole \
  --requirements-file requirements.txt \
  --authorizer-config "{\"customJWTAuthorizer\":{\"discoveryUrl\":\"$DISCOVERY_URL\",\"allowedClients\":[\"$CLIENT_ID\"]}}" \
  --request-header-allowlist "Authorization"

agentcore launch

And here is the code for the Strands agent the above command creates (this would be src/main.py):

import os

from bedrock_agentcore.runtime import BedrockAgentCoreApp
from mcp.client.streamable_http import streamablehttp_client
from strands import Agent
from strands.models import BedrockModel
from strands.tools.mcp.mcp_client import MCPClient

app = BedrockAgentCoreApp()


@app.entrypoint
def agent_invocation(payload, context):
    prompt = payload["prompt"]
    auth_header = context.request_headers.get("Authorization")

    if not auth_header:
        return {"error": "Missing Authorization header"}

    gateway_url = os.environ["AGENTCORE_GATEWAY_URL"]

    mcp_client = MCPClient(
        lambda: streamablehttp_client(
            gateway_url,
            headers={"Authorization": auth_header},
        )
    )

    with mcp_client:
        tools = mcp_client.list_tools_sync()

        agent = Agent(
            model=BedrockModel(
                inference_profile_id=os.environ["BEDROCK_INFERENCE_PROFILE_ID"],
                temperature=0.0,
                streaming=True,
            ),
            tools=tools,
            system_prompt=(
                "You are a production operations agent. "
                "Start with read-only diagnostics. "
                "Never claim a tool succeeded unless you received a successful tool result. "
                "If a tool call is denied, explain that the platform blocked it."
            ),
        )

        response = agent(prompt)
        return {"result": response.message}


if __name__ == "__main__":
    app.run()

Note that the agent only gets tools through MCP, and MCP only comes from Gateway. If our agent could bypass Gateway and call SDK clients directly, our Gateway policies would be just theater.

Give the Agent its Own Identity

When you're executing actions, it's ok to use your user. However, with an agent like this it's not you who's deciding and executing the actions. Your action is to call the agent. The agent's actions are its own, so it needs its own identity. This is important for both permissions management and traceability: We need to define what the agent is allowed to do (which may not be the same set of actions you're allowed to do), and we need to know that it was the agent who executed them.

We'll forward the user JWT from Runtime to AgentCore Gateway because Gateway Policy uses OAuth claims as tags on AgentCore::OAuthUser. But the actual downstream execution path won't use the user's User. Gateway uses its service role or configured outbound auth to reach targets. AgentCore Identity also supports both user-delegated and machine-to-machine auth patterns for outbound access, and Runtime automatically creates a workload identity. For first-party AgentCore services, Runtime can exchange the inbound JWT for a Workload Access Token and deliver it into your agent execution.

And things look like this:

  • User JWT: who is asking, what tenant/role/scope context applies

  • Runtime / Gateway / workload identities: what infrastructure identity actually executes

  • Policy engines: where the decision is made

AgentCore Gateway as the Tool Surface

Our agent should have just 3 or 4 tools. However, once you have dozens or hundreds of tools, the failure mode of agents shifts away from “the model is weak” and toward drift: auth changes, schema changes, rate limits, ownership confusion, and tool access becoming unmanageable.

The solution is a platform layer with tool governance, version control, owners, SLAs, contract tests, and FGAC at the gateway or policy layer. AgentCore Gateway is built to do exactly that. It gives you one MCP endpoint for tool discovery and invocation, explicit inbound and outbound auth, and a unified place to attach Policy. Gateway supports IAM-based outbound auth using the gateway service role (ideal for AWS targets), and OAuth or API keys (ideal for external targets). You can also use “no auth”, but please don't.

Policy Engine and AgentCore Gateway

First you create a Policy Engine:

import boto3

control = boto3.client("bedrock-agentcore-control")

policy_engine = control.create_policy_engine(
    name="ops-policy-engine",
    description="FGAC for operations-agent tool calls"
)

print(policy_engine["policyEngineId"])
print(policy_engine["policyEngineArn"])

Then you create the Gateway with JWT inbound auth and attach the Policy Engine in ENFORCE mode:

import boto3

control = boto3.client("bedrock-agentcore-control")

gateway = control.create_gateway(
    name="ops-gateway",
    roleArn="arn:aws:iam::123456789012:role/OpsGatewayExecutionRole",
    protocolType="MCP",
    authorizerType="CUSTOM_JWT",
    authorizerConfiguration={
        "customJWTAuthorizer": {
            "discoveryUrl": "https://cognito-idp.us-east-1.amazonaws.com/us-east-1_Example/.well-known/openid-configuration",
            "allowedClients": ["exampleclientid"],
        }
    },
    policyEngineConfiguration={
        "arn": policy_engine["policyEngineArn"],
        "mode": "ENFORCE",
    },
    exceptionLevel="DEBUG",
)

print(gateway["gatewayUrl"])

AgentCore Gateway Execution Role

If you use a custom Gateway execution role, that role needs the permissions required for AgentCore Policy to work:

  • bedrock-agentcore:AuthorizeAction

  • bedrock-agentcore:PartiallyAuthorizeActions

  • bedrock-agentcore:GetPolicyEngine

The trust policy for the execution role should look like this:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowBedrockAgentCoreAssumeRole",
      "Effect": "Allow",
      "Principal": {
        "Service": "bedrock-agentcore.amazonaws.com"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "aws:SourceAccount": "123456789012"
        },
        "ArnLike": {
          "aws:SourceArn": "arn:aws:bedrock-agentcore:us-east-1:123456789012:*"
        }
      }
    }
  ]
}

And the permission policy excerpt for Policy integration looks like this:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "PolicyEngineConfiguration",
      "Effect": "Allow",
      "Action": [
        "bedrock-agentcore:GetPolicyEngine"
      ],
      "Resource": [
        "arn:aws:bedrock-agentcore:us-east-1:123456789012:policy-engine/<policy-engine-id>"
      ]
    },
    {
      "Sid": "PolicyEngineAuthorization",
      "Effect": "Allow",
      "Action": [
        "bedrock-agentcore:AuthorizeAction",
        "bedrock-agentcore:PartiallyAuthorizeActions"
      ],
      "Resource": [
        "arn:aws:bedrock-agentcore:us-east-1:123456789012:policy-engine/<policy-engine-id>",
        "arn:aws:bedrock-agentcore:us-east-1:123456789012:gateway/<gateway-id>"
      ]
    }
  ]
}

You still need the target-specific permissions too, because Gateway also has to invoke your downstream Lambda, API Gateway stage, OpenAPI target, Smithy target, or MCP server. Those vary by integration type, so I'll leave them out.

By the way, if you create the Gateway with the AgentCore starter toolkit, it may auto-create an execution role with wide bedrock-agentcore:* permissions. That's great for demos, but in production please tighten it back down to least privilege.

Tools in AgentCore Gateway

Gateway tool names use the format ${target_name}__${tool_name}. For example, our tools look like CloudOps__restart_instance. The target prefix is the string that becomes the AgentCore::Action in Policy. If you rename a tool, you are changing your authorization surface. Which is intentional.

Permissions at the Tool Level

AgentCore Policy (currently in preview, March 6th 2026) sits in front of AgentCore Gateway tool execution and evaluates Cedar policies against the actual incoming request. The Gateway builds a Cedar authorization request from two things:

  • the JWT token, which becomes the AgentCore::OAuthUser principal plus tags for claims

  • the MCP tool call, whose arguments become context.input

The resulting Cedar request has the following elements:

  • principal: AgentCore::OAuthUser::"sub"

  • action: AgentCore::Action::"Target__tool"

  • resource: AgentCore::Gateway::"gateway-arn"

  • context: { input: ...tool arguments... }

A Policy for the Operations Agent

We want the following permissions:

  • readers with ops:read scope to use read-only tools

  • only SREs to even attempt restart

  • tenant match between caller claim and tool arguments

  • restart only in staging

  • explicit deny for non-staging restart calls

permit(
  principal is AgentCore::OAuthUser,
  action in [
    AgentCore::Action::"CloudOps__read_metrics",
    AgentCore::Action::"CloudOps__search_logs",
    AgentCore::Action::"CloudOps__list_recent_deployments",
    AgentCore::Action::"CloudOps__create_incident_ticket"
  ],
  resource == AgentCore::Gateway::"arn:aws:bedrock-agentcore:us-east-1:123456789012:gateway/ops-gateway"
)
when {
  principal.hasTag("scope") &&
  principal.getTag("scope") like "*ops:read*"
};

permit(
  principal is AgentCore::OAuthUser,
  action == AgentCore::Action::"CloudOps__restart_instance",
  resource == AgentCore::Gateway::"arn:aws:bedrock-agentcore:us-east-1:123456789012:gateway/ops-gateway"
)
when {
  principal.hasTag("role") &&
  principal.getTag("role") == "sre" &&
  principal.hasTag("tenant_id") &&
  principal.getTag("tenant_id") == context.input.tenantId &&
  context.input.environment == "staging"
};

forbid(
  principal is AgentCore::OAuthUser,
  action == AgentCore::Action::"CloudOps__restart_instance",
  resource == AgentCore::Gateway::"arn:aws:bedrock-agentcore:us-east-1:123456789012:gateway/ops-gateway"
)
unless {
  context.input.environment == "staging"
};

And here is how you add it to the Policy Engine:

import boto3

control = boto3.client("bedrock-agentcore-control")

policy_statement = r'''
permit(
  principal is AgentCore::OAuthUser,
  action in [
    AgentCore::Action::"CloudOps__read_metrics",
    AgentCore::Action::"CloudOps__search_logs",
    AgentCore::Action::"CloudOps__list_recent_deployments",
    AgentCore::Action::"CloudOps__create_incident_ticket"
  ],
  resource == AgentCore::Gateway::"arn:aws:bedrock-agentcore:us-east-1:123456789012:gateway/ops-gateway"
)
when {
  principal.hasTag("scope") &&
  principal.getTag("scope") like "*ops:read*"
};

permit(
  principal is AgentCore::OAuthUser,
  action == AgentCore::Action::"CloudOps__restart_instance",
  resource == AgentCore::Gateway::"arn:aws:bedrock-agentcore:us-east-1:123456789012:gateway/ops-gateway"
)
when {
  principal.hasTag("role") &&
  principal.getTag("role") == "sre" &&
  principal.hasTag("tenant_id") &&
  principal.getTag("tenant_id") == context.input.tenantId &&
  context.input.environment == "staging"
};

forbid(
  principal is AgentCore::OAuthUser,
  action == AgentCore::Action::"CloudOps__restart_instance",
  resource == AgentCore::Gateway::"arn:aws:bedrock-agentcore:us-east-1:123456789012:gateway/ops-gateway"
)
unless {
  context.input.environment == "staging"
};
'''

response = control.create_policy(
    policyEngineId="pe-1234567890abcdef",
    name="ops-tool-fgac",
    validationMode="FAIL_ON_ANY_FINDINGS",
    description="Tool-level and parameter-level authorization for the operations agent",
    definition={
        "cedar": {
            "statement": policy_statement
        }
    }
)

print(response["policyId"])

It's important to note that tools/list is treated as a meta action. When Gateway lists tools, it does not have the full input parameters for a specific invocation yet. So a tool may appear in the list if there exists any circumstance under which the user could call it. However, a tool showing up in tools/list does not guarantee that a later tools/call will be allowed. The real authorization decision happens on the invocation with full context.input.

Verified Permissions for AgentCore Gateway Tools

Gateway Policy is excellent at answering questions like:

  • does this caller have the ops:read scope

  • is this caller an sre

  • does the tenant claim match the requested tenant

  • is environment == "staging"

That is already a massive improvement over just giving the agent a role. It is Fine-Grained Access Control (FGAC) at the tool and parameter level. But Gateway Policy cannot answer questions that depend on application entities outside the AgentCore schema. It cannot reference your own Service, Team, Tenant, or EnvironmentPolicy entity types.

That means Gateway Policy cannot natively answer things like:

  • Does the caller’s team actually own payments-api?

  • Is payments-api even associated with tenant Acme?

  • Is this service currently under an additional freeze or maintenance rule?

  • Does this caller belong to the allowed break-glass group for this exact service?

Those are not tool-boundary questions anymore. They are business-resource questions, and we'll use Amazon Verified Permissions to answer them.

Model the application authorization explicitly

For the internal ops API behind CloudOps__restart_instance, I would model at least:

  • Namespace: Ops

  • Principal types: Ops::User, Ops::Role

  • Resource type: Ops::Service

  • Actions: Ops::Action::readMetrics, Ops::Action::searchLogs, Ops::Action::listRecentDeployments, Ops::Action::createIncidentTicket, Ops::Action::restartInstance

  • Service attributes: ownerTeam, tenantId

  • Context keys: environment, maybe maintenanceWindow, maybe requestSource

A business policy in Cedar

Here is the kind of policy I actually want for the restart path:

permit(
  principal in Ops::Role::"sre",
  action == Ops::Action::"restartInstance",
  resource
)
when {
  principal.team == resource.ownerTeam &&
  principal.tenantId == resource.tenantId &&
  context.environment == "staging"
};

We're considering more than just the role and Role-Based Access Control (RBAC). We're considering if the caller is in the sre role, if the caller's team owns teh service, if the caller’s tenant matches the service’s tenant, and if the environment is staging.

Verified Permissions in APIs

Now that we have our Verified Permissions policy, we need to add it to whatever internal API we have behind the tool:

import boto3

vp = boto3.client("verifiedpermissions")


def authorize_restart(
    *,
    policy_store_id: str,
    user_sub: str,
    user_team: str,
    user_tenant_id: str,
    service_id: str,
    service_owner_team: str,
    service_tenant_id: str,
    environment: str,
) -> bool:
    response = vp.is_authorized(
        policyStoreId=policy_store_id,
        principal={
            "entityType": "Ops::User",
            "entityId": user_sub,
        },
        action={
            "actionType": "Ops::Action",
            "actionId": "restartInstance",
        },
        resource={
            "entityType": "Ops::Service",
            "entityId": service_id,
        },
        context={
            "contextMap": {
                "environment": {"string": environment},
            }
        },
        entities={
            "entityList": [
                {
                    "identifier": {
                        "entityType": "Ops::User",
                        "entityId": user_sub,
                    },
                    "attributes": {
                        "team": {"string": user_team},
                        "tenantId": {"string": user_tenant_id},
                    },
                    "parents": [
                        {
                            "entityType": "Ops::Role",
                            "entityId": "sre",
                        }
                    ],
                },
                {
                    "identifier": {
                        "entityType": "Ops::Service",
                        "entityId": service_id,
                    },
                    "attributes": {
                        "ownerTeam": {"string": service_owner_team},
                        "tenantId": {"string": service_tenant_id},
                    },
                },
            ]
        },
    )

    return response["decision"] == "ALLOW"

And inside the tool implementation:

def restart_instance_tool(
    *,
    user_sub: str,
    user_team: str,
    user_tenant_id: str,
    service_id: str,
    environment: str,
) -> dict:
    # Resolve business data from your control plane or service catalog.
    service = lookup_service(service_id)

    allowed = authorize_restart(
        policy_store_id="ps-1234567890abcdef",
        user_sub=user_sub,
        user_team=user_team,
        user_tenant_id=user_tenant_id,
        service_id=service_id,
        service_owner_team=service["ownerTeam"],
        service_tenant_id=service["tenantId"],
        environment=environment,
    )

    if not allowed:
        raise PermissionError(
            f"Restart denied for service={service_id}, env={environment}"
        )

    # Only now do the restart.
    return do_restart(service_id=service_id, environment=environment)

If Gateway Policy allowed the call because it was staging and the JWT tenant matched the request tenant, but the service is actually owned by another team, Verified Permissions still denies it.

What This Architecture Prevents

Let’s go back to the original request. The user asks the operations agent to investigate Acme 5XXs in production and “restart the affected instance if it is safe”. Let's consider a few scenarios for our agent.

Scenario 1: Read-only diagnostics

The agent calls:

  • CloudOps__read_metrics(tenantId="acme", environment="prod")

  • CloudOps__search_logs(tenantId="acme", environment="prod")

  • CloudOps__list_recent_deployments(serviceId="payments-api", environment="prod")

Gateway Policy allows these because the caller has ops:read. No mutation happened. This was kind of the easy path though, level 1

Scenario 2: The agent proposes a restart in production

Say the agent detects a problem, and it decides the fastest path to mitigate it is to send this action

{
  "serviceId": "payments-api",
  "tenantId": "acme",
  "environment": "prod"
}

to the tool CloudOps__restart_instance. Watch closely, it says "environment": "prod".

Gateway Policy denies it because the environment is not staging, and the tool call does not execute. Exactly as we intended, done automatically and reliably, with no human intervention, and not relying on a prompt like “Plz don't break prod bro, we'll both get fired”.

Scenario 3: The agent proposes a restart in staging

Now let's say we have the same request as above, but for staging:

{
  "serviceId": "payments-api",
  "tenantId": "acme",
  "environment": "staging"
}

Gateway Policy may allow this, because the request is structurally within the allowed boundary. But the internal ops API still calls Amazon Verified Permissions. If the caller’s team does not own payments-api, or if payments-api is not actually an Acme service, AVP denies the action.

It's not enough that the model can call the tool with the right parameters. We also check outside the tool itself, for characteristics of the target resource like which team owns it or what tenant does it belong to.

Wrapping Up

The model can decide which tool might help. But it's the underlying platform and security configuration that decides whether that tool call is allowed, with those parameters, for those resources, from that user.

Use AgentCore Runtime as the runtime platform. Use resource-based policies there, and if you use endpoints, remember the Runtime and endpoint policies are both part of authorization. Use Strands, but don't give the agent direct access to resources through the AWS SDK. Use AgentCore Gateway as the only tool surface, with explicit inbound and outbound auth. Use AgentCore Policy to evaluate the actual tool call against JWT-derived principal tags and real input parameters before execution. Then use Amazon Verified Permissions inside the tool implementation when the decision depends on actual business resources and state.

And use evals and observability. But I'll leave that for another article.

Login or Subscribe to participate

Reply

Avatar

or to participate

Recommended for you