Architecture

Gatekeeper Architecture

Gatekeeper is a unified AI API gateway: a single endpoint for target provider candidates, with RBAC, budget controls, semantic routing, and audit logging validated during assisted setup. This document covers the internal design.

Routing Engine

Every request enters the routing engine, which resolves the target provider in three steps:

Incoming Request │ ▼ ┌─────────────────────────────────────────────────┐ │ 1. Authentication │ │ Validate virtual key → resolve user + team │ └──────────────────────┬──────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────┐ │ 2. Policy Check │ │ RBAC rules · budget check · model allowlist │ └──────────────────────┬──────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────┐ │ 3. Provider Dispatch │ │ model → provider adapter → upstream API │ └──────────────────────┬──────────────────────────┘ │ ▼ Response (tokens logged, budget decremented)

Routing latency is validated during assisted onboarding for the target deployment. Provider API call latency and Gatekeeper overhead are documented after proof.

Model aliases are supported: you can route gpt-4 to claude-3-5-sonnet-20241022 transparently, allowing teams to migrate providers without changing application code.

RBAC Model

Gatekeeper uses a three-level RBAC hierarchy: Organization → Team → Virtual Key. Permissions flow downward and can only be restricted, never expanded.

Organization

Master keys, billing, global provider credentials, audit log access.

Team

Team-scoped budgets, model allowlists, member management.

Virtual Key

Per-application or per-user limits. Rotation without touching provider creds.

RBAC is enforced on every request, not just at key creation time. If a team's model allowlist changes, all existing virtual keys under that team immediately reflect the change — no key rotation needed.

Budget System

Budgets are enforced at three levels: organization, team, and individual virtual key. Budgets can be configured as hard limits (request blocked) or soft limits (alert only).

Token counting is validated against each provider's reported usage. Exact accounting, estimation fallback, and atomic budget decrement behavior are confirmed during onboarding.

Budget check pseudocode

1. Pre-request: check budget_remaining > 0
   (reject if hard limit, warn if soft limit)

2. Post-response: decrement budget_remaining
   by provider-reported prompt + completion tokens × price

3. Alert webhooks fire when remaining < 20%

Budget periods are configurable: daily, weekly, monthly, or rolling 30-day window. Carry-over unused budget is not supported — limits reset at period boundaries.

Provider Adapters

Gatekeeper includes adapters for 15+ provider APIs and supports target models. The adapter layer normalizes provider-specific request/response formats to a single OpenAI-compatible schema.

Provider	Models	Endpoint	Streaming
OpenAI	GPT-4o, o1, o3	Native	Validate
Anthropic	Claude 3.5, 3 family	Native	Validate
Google	Gemini 2.0, 1.5	Vertex AI	Validate
AWS Bedrock	Titan, Nova, Llama	AWS SDK	Validate
Mistral	Mistral Large, Nemo	Native	Validate
Groq	Llama 3.3, Mixtral	OpenAI-compat	Validate
Cohere	Command R+	Native	Validate

Audit Logging

Every request through Gatekeeper is logged with: timestamp, virtual key ID, model, provider, prompt tokens, completion tokens, cost, latency, and HTTP status. Logs are immutable and stored for 90 days (configurable).

Logs are queryable via the dashboard and via the /v1/usage API. You can filter by date range, virtual key, model, or provider to produce cost reports.

For self-hosted deployments, logs can be forwarded to S3, ClickHouse, or any HTTP endpoint. The Gatekeeper dashboard reads from the local database; your data warehouse integration is independent.

Semantic Caching

Gatekeeper optionally caches responses using semantic similarity. Requests with similar embeddings (above a configurable threshold) return the cached response without hitting the provider, reducing both latency and cost.

The cache is backed by Redis with configurable TTL. Cache hit rates of 20–40% are common for customer support and document Q&A workloads. Cache is keyed per-team — teams do not share cached responses.