GATEKEEPER
G
Gatekeeper
Architecture

Gatekeeper Architecture

Gatekeeper is a unified AI API gateway: a single endpoint for 290+ models across every major provider, with RBAC, budget controls, semantic routing, and full audit logging built in. This document covers the internal design.

Routing Engine

Every request enters the routing engine, which resolves the target provider in three steps:

Incoming Request │ ▼ ┌─────────────────────────────────────────────────┐ │ 1. Authentication │ │ Validate virtual key → resolve user + team │ └──────────────────────┬──────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────┐ │ 2. Policy Check │ │ RBAC rules · budget check · model allowlist │ └──────────────────────┬──────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────┐ │ 3. Provider Dispatch │ │ model → provider adapter → upstream API │ └──────────────────────┬──────────────────────────┘ │ ▼ Response (tokens logged, budget decremented)

The routing engine adds <2ms of latency in the p99 case. Provider API call latency dominates; Gatekeeper does not meaningfully slow down requests.

Model aliases are supported: you can route gpt-4 to claude-3-5-sonnet-20241022 transparently, allowing teams to migrate providers without changing application code.

RBAC Model

Gatekeeper uses a three-level RBAC hierarchy: Organization → Team → Virtual Key. Permissions flow downward and can only be restricted, never expanded.

Organization

Master keys, billing, global provider credentials, audit log access.

Team

Team-scoped budgets, model allowlists, member management.

Virtual Key

Per-application or per-user limits. Rotation without touching provider creds.

RBAC is enforced on every request, not just at key creation time. If a team's model allowlist changes, all existing virtual keys under that team immediately reflect the change — no key rotation needed.

Budget System

Budgets are enforced at three levels: organization, team, and individual virtual key. Budgets can be configured as hard limits (request blocked) or soft limits (alert only).

Token counting happens in real time using the provider's reported usage. Gatekeeper does not estimate tokens — it uses the exact figure from the provider response and decrements the budget atomically in the database.

Budget check pseudocode

1. Pre-request: check budget_remaining > 0
   (reject if hard limit, warn if soft limit)

2. Post-response: decrement budget_remaining
   by provider-reported prompt + completion tokens × price

3. Alert webhooks fire when remaining < 20%

Budget periods are configurable: daily, weekly, monthly, or rolling 30-day window. Carry-over unused budget is not supported — limits reset at period boundaries.

Provider Adapters

Gatekeeper includes adapters for 15+ provider APIs and supports 290+ models. The adapter layer normalizes provider-specific request/response formats to a single OpenAI-compatible schema.

ProviderModelsEndpointStreaming
OpenAIGPT-4o, o1, o3NativeYes
AnthropicClaude 3.5, 3 familyNativeYes
GoogleGemini 2.0, 1.5Vertex AIYes
AWS BedrockTitan, Nova, LlamaAWS SDKYes
MistralMistral Large, NemoNativeYes
GroqLlama 3.3, MixtralOpenAI-compatYes
CohereCommand R+NativeYes

Audit Logging

Every request through Gatekeeper is logged with: timestamp, virtual key ID, model, provider, prompt tokens, completion tokens, cost, latency, and HTTP status. Logs are immutable and stored for 90 days (configurable).

Logs are queryable via the dashboard and via the /v1/usage API. You can filter by date range, virtual key, model, or provider to produce cost reports.

For self-hosted deployments, logs can be forwarded to S3, ClickHouse, or any HTTP endpoint. The Gatekeeper dashboard reads from the local database; your data warehouse integration is independent.

Semantic Caching

Gatekeeper optionally caches responses using semantic similarity. Requests with similar embeddings (above a configurable threshold) return the cached response without hitting the provider, reducing both latency and cost.

The cache is backed by Redis with configurable TTL. Cache hit rates of 20–40% are common for customer support and document Q&A workloads. Cache is keyed per-team — teams do not share cached responses.