Architecture

Central orchestration platform for AI agent and team execution. Manages workflows, worker coordination, policy enforcement, and integrations. The Control Plane serves as the central nervous system for Kubiya’s distributed AI execution platform, addressing the challenges of coordinating multi-tenant agent workloads, enforcing security policies, and maintaining consistent state across heterogeneous execution environments. It provides a unified orchestration layer that abstracts the complexity of workflow management, resource allocation, and integration coordination.

Deployment Options

Choose between fully managed SaaS or self-hosted deployment based on your requirements.

Kubiya Hosted (SaaS)

Fully managed infrastructure with zero operational overhead. All components, scaling, security, and maintenance handled by Kubiya. Benefits: Zero ops • Auto-scaling • Enterprise security • 99.9% uptime SLA • Continuous updates

Self-Hosted

Stack: Docker Compose (local) • Kubernetes/Helm (production) • Temporal Server • PostgreSQL • Redis Benefits: Full control • On-premises • Custom security • Air-gapped support

High-Level Architecture

Distributed, multi-tenant architecture for AI agent orchestration. Core Components:

API Layer: FastAPI-based REST and WebSocket interface for multi-tenant access
Orchestration: Temporal workflows for reliable, distributed task execution
Data Persistence: PostgreSQL with row-level security for tenant isolation
Caching & State: Redis for performance optimization and real-time communication
Security: JWT/API key authentication with OPA-based policy enforcement
LLM Gateway: LiteLLM for unified access to multiple AI model providers
Knowledge Layer: Context Graph for organizational memory and cross-agent learning

Control Plane API

The Control Plane API provides a comprehensive REST and WebSocket interface for managing the entire lifecycle of AI agents, teams, and workflows. Architectural Approach:

Multi-tenant by design: All data access enforced at the database level through row-level security
Async-first architecture: Built on FastAPI with async/await for high-concurrency workloads
Strongly typed contracts: Pydantic models ensure request/response validation
Real-time communication: WebSocket endpoints for streaming execution logs and live status updates
Structured observability: Correlation IDs and structured logging for distributed tracing

API Categories:

Agent & Team Management: Create, configure, and orchestrate AI agents and multi-agent teams
Execution Control: Submit tasks, monitor progress, stream outputs, and handle approvals
Resource Management: Task queue registration, environment configuration, model registry
Governance: Policy definition, secret management, integration credentials
Analytics & Observability: Usage metrics, cost tracking, execution history

The API serves as the primary integration point for CI/CD pipelines, ChatOps interfaces, and custom automation workflows.

Authentication & Authorization

Multi-tenant authentication with organization-level isolation ensures secure access across all Control Plane operations. Authentication Methods:

JWT Bearer Tokens: User-scoped authentication with role-based access control
API Keys: Service-to-service authentication for automation and integrations

Security Architecture:

Token Caching: Redis-backed cache layer minimizes external validation latency
Organization Scoping: Every request automatically scoped to authenticated organization
Database-Level Isolation: PostgreSQL row-level security (RLS) enforces tenant boundaries
Role-Based Access Control (RBAC): Fine-grained permissions for resource access
Automatic Token Rotation: Built-in support for credential lifecycle management

All authenticated requests establish an organization context that flows through the entire request lifecycle, ensuring consistent security enforcement from API entry through database access.

Policy Enforcement

OPA-based access control and governance. Policy Types: Tool usage control • Resource access restrictions • Execution constraints (limits/quotas) • Approval workflows Architecture Decision: Why OPA?

Decoupled policy logic: Policies defined separately from application code enable non-engineers to manage governance
Declarative approach: Rego policy language allows expressing complex rules as data
Centralized enforcement: Single policy engine across all agents, teams, and workflows
Audit transparency: Policy decisions are logged with full context for compliance reporting
Dynamic evaluation: Policies evaluated at execution time based on current state and context

Policy Scope:

Tool Usage Control: Restrict which Skills and MCP servers agents can invoke
Resource Access: Limit file paths, network destinations, or cloud resources
Execution Constraints: Enforce quotas on compute time, token usage, or cost
Approval Workflows: Require human-in-the-loop approval for sensitive operations

Enforcement Points:

Pre-execution: Validate permissions before workflow submission
During execution: Monitor and enforce limits in real-time
Post-execution: Generate audit logs for compliance and security review

Policies can be attached at multiple levels (organization, environment, team, agent) with inheritance and override semantics that balance flexibility with security.

Worker Registration & Lifecycle

Workers register with the Control Plane to receive and execute tasks, establishing a durable connection to the distributed orchestration system. Registration Flow:

Initial Connection: Worker authenticates with API key and queue identifier
Configuration Receipt: Control Plane provides connection credentials for Temporal and LiteLLM
Queue Binding: Worker connects to organization-specific Temporal task queue
Polling Activation: Worker begins polling for available workflow tasks

Health Monitoring Strategy:

Lightweight Heartbeats: Frequent status updates with minimal overhead (task count, current state)
Full Heartbeats: Periodic detailed reports including system metrics and execution logs
Redis-Based State: Fast lookups for worker status without database queries
Automatic Staleness Detection: Workers that stop reporting are marked inactive

This dual-level heartbeat approach balances real-time visibility with system efficiency, enabling the Control Plane to route tasks only to healthy workers while minimizing network and storage overhead.

Task Queue Architecture

Hierarchical task routing and worker management.

For a beginner-friendly introduction to workers and task queues, see Workers Overview and Task Queues.

Architecture Rationale: The four-level hierarchy (Organization → Environment → Task Queue → Workers) solves several key challenges:

Tenant Isolation: Organization-level separation ensures complete data and execution isolation
Environment Segregation: Separate queues for production, staging, and development prevent cross-contamination
Capacity Management: Task queues provide named capacity pools that can be scaled independently
Flexible Routing: Multiple routing strategies support different use cases:
- AUTO: Temporal load-balances across available workers (default)
- SPECIFIC_QUEUE: Direct targeting for specialized hardware (GPU, high-memory)
- ENVIRONMENT: Restrict execution to specific deployment contexts

Queue Naming Convention: Each Temporal task queue is named {organization_id}.{queue_uuid}, ensuring global uniqueness while maintaining logical hierarchy. This naming scheme enables:

Fast organization-level filtering and metrics
Queue-level access control policies
Clear audit trails in execution logs

Load Balancing: Temporal’s built-in task distribution ensures work is fairly distributed across all workers polling a queue, with automatic retries and dead-letter handling for failed tasks.

Execution Flow

End-to-end orchestration from user request through task completion, with support for long-running workflows and human-in-the-loop interactions. Workflow Types:

Agent Execution: Single-agent task processing with tool invocation and multi-turn conversations
Team Execution: Multi-agent coordination with inter-agent communication and shared context
Scheduled Jobs: Cron-based recurring workflows for automation and monitoring

Key Architectural Features: Durable Execution: Temporal’s workflow engine ensures executions survive process restarts, network failures, and infrastructure changes. Workflow state is persisted and can be resumed exactly where it left off. Real-Time Observability:

WebSocket Streaming: Live execution logs and status updates
Query Endpoints: Synchronous status checks without interrupting execution
Signal Handling: External events can influence running workflows (user input, cancellation)

Human-in-the-Loop (HITL): Workflows can pause for human approval or input, maintaining full state while waiting. When input arrives via signal, execution resumes seamlessly with complete conversation context. State Decisions: After each agent turn, an AI-powered state analyzer determines whether the task is complete, requires user input, or should continue execution. This enables natural multi-turn interactions without hard-coded state machines.

Temporal Workflow Architecture

Architecture Decision: Why Temporal? Temporal addresses several critical challenges in distributed agent orchestration:

Durability: Workflow state persists across failures, restarts, and infrastructure changes
Deterministic Replay: Workflows can be reconstructed and replayed for debugging or recovery
Built-in Retry Logic: Configurable retry policies with exponential backoff for transient failures
Long-Running Workflows: Support for workflows that run hours or days (HITL, scheduled jobs)
Distributed Task Queues: Native load balancing across heterogeneous worker pools
Versioning: Safe deployment of workflow changes without disrupting in-flight executions

Workflow Structure: Each workflow type implements a distinct orchestration pattern:

Agent Workflows: Linear execution with dynamic tool invocation and state transitions
Team Workflows: Parallel and sequential agent coordination with shared context
Scheduled Workflows: Wrapper workflows that handle cron triggers and error handling

Activity Pattern: Workflows delegate actual work to activities (atomic units of execution):

LLM Inference: Calls to model providers through LiteLLM gateway
Database Operations: Session storage, execution status updates
Analytics: Token usage, cost tracking, turn metrics
External Integration: Calls to Context Graph, policy enforcer, storage services

Activities can be retried independently of the workflow, enabling fine-grained error recovery without restarting entire executions. Execution Guarantees:

At-least-once execution for all activities (idempotency required)
Exactly-once workflow decisions (deterministic replay)
Strong consistency for workflow state

LLM Gateway

Architecture Decision: Why LiteLLM? LiteLLM solves the challenge of managing multiple LLM providers with incompatible APIs:

Unified Interface: All providers exposed through OpenAI-compatible API
Zero Code Changes: Switch providers by changing model identifier, no code changes
Automatic Retries: Built-in retry logic with exponential backoff for rate limits
Response Caching: Reduce costs and latency for repeated queries
Cost Tracking: Unified token counting and cost calculation across providers
Fallback Chains: Automatic failover to backup models on provider outages

Provider Support:

Commercial: OpenAI, Anthropic, Google, Microsoft Azure OpenAI, Mistral, Cohere
Open Source: Replicate, Together AI, Anyscale, Perplexity
Self-Hosted: vLLM, Ollama, LM Studio, custom OpenAI-compatible endpoints

Routing Strategy: Model identifiers follow the pattern {provider}/{model-name}:

openai/gpt-4o → OpenAI GPT-4o
anthropic/claude-sonnet-4 → Anthropic Claude Sonnet
gemini/gemini-2.0-flash → Google Gemini

Caching Behavior: Responses are cached based on model, messages, and parameters. Cache hits eliminate model provider calls entirely, reducing both cost and latency. Cache invalidation is automatic based on configured TTL. Streaming Support: Both synchronous (full response) and streaming (token-by-token) modes supported, with unified error handling and timeout management across all providers.

Context Graph Integration

The Context Graph provides persistent organizational knowledge that enhances agent intelligence across all executions. Architectural Role: The Context Graph serves as a central knowledge repository that agents query to:

Historical Context: Learn from past executions, decisions, and outcomes
Organizational Knowledge: Access company-specific information, terminology, and processes
Cross-Agent Learning: Benefit from insights gathered by other agents and teams
Entity Relationships: Understand connections between projects, resources, and team members

Integration Pattern: The Control Plane acts as an HTTP proxy to the Context Graph API, providing:

Authenticated Access: All queries inherit organization context from the authenticated session
Query Translation: Natural language questions translated to structured graph queries
Result Enrichment: Graph results augmented with execution-time context
Caching: Frequently accessed knowledge cached in Redis for performance

Use Cases:

Onboarding: New agents immediately benefit from accumulated organizational knowledge
Consistency: All agents reference the same authoritative sources
Evolution: Knowledge base grows automatically from agent interactions and human feedback
Retrieval-Augmented Generation (RAG): Context-aware responses based on organizational data

This integration enables agents to operate with institutional knowledge rather than relying solely on their foundation model’s training data.

Cognitive Memory

Persistent session storage enables agents to maintain context across multi-turn conversations and long-running workflows. Memory Architecture: Cognitive memory is stored in PostgreSQL using a structured schema that captures:

Conversation History: Complete message sequences with timestamps and metadata
Agent State: Current context variables, active tools, and execution phase
Session Metadata: User identifiers, environment context, policy constraints
Turn Analytics: Token usage, latency, and cost per interaction

Persistence Strategy: Unlike stateless API interactions, agent executions maintain durable sessions that:

Survive Restarts: Sessions persist across worker failures and deployments
Enable Resumption: Long-running tasks can pause and resume without losing context
Support HITL: Human-in-the-loop workflows maintain full conversation state while waiting
Facilitate Debugging: Complete execution history available for post-mortem analysis

Multi-Turn Conversation Support: Memory enables natural conversational interactions where agents:

Reference earlier parts of the conversation
Build incrementally on previous decisions
Clarify ambiguous requests through back-and-forth dialogue
Maintain task context across multiple human inputs

Performance Optimization: Recent conversation history is cached in Redis for fast access during active executions, with full history retrieved from PostgreSQL only when needed for longer context windows or historical analysis.

Storage System

Architecture Decision: Multi-Cloud Strategy Supporting multiple cloud providers solves several challenges:

Deployment Flexibility: Customers can use their existing cloud infrastructure
Compliance Requirements: Data residency regulations may dictate storage location
Cost Optimization: Organizations can leverage existing cloud commitments
Vendor Independence: Avoid lock-in to single cloud provider
Hybrid Deployments: Self-hosted control planes can use on-premises storage

Supported Providers:

AWS S3: Industry-standard object storage with broad regional availability
Google Cloud Storage: Deep integration with GCP services
Azure Blob Storage: Native support for Azure-based deployments
S3-Compatible: MinIO, Wasabi, Backblaze B2, or any S3-compatible service

Storage Features: Quota Management: Per-organization storage limits with soft warnings and hard enforcement. Quota tracking includes both file count and total bytes, with visibility in analytics dashboards. Metadata Tracking: File metadata stored in PostgreSQL enables:

Fast searches without scanning object storage
Tag-based filtering and organization
Relationship tracking (which executions used which files)
Versioning and lifecycle management

Soft Delete: Deleted files are marked inactive but retained for configurable retention period, enabling recovery from accidental deletions and supporting audit requirements. Access Control: All storage operations respect organization-level isolation and OPA policies. Agents can only access files within their organization scope, with optional policy-based restrictions on file types or sizes. API Design: RESTful endpoints for standard CRUD operations, with support for:

Presigned URLs for direct uploads (bypassing Control Plane for large files)
Batch operations for efficiency
Streaming downloads for large files
Tag-based search and filtering

Technology Stack & Dependencies

The Control Plane’s architecture is built on proven, production-grade technologies chosen for specific technical requirements.

Core Infrastructure

PostgreSQL - Multi-tenant relational database

Why chosen: Native row-level security (RLS) for tenant isolation, ACID guarantees, rich query capabilities
Role: Primary data store for agents, teams, executions, policies, and all configuration
Key features: JSON columns for flexible schemas, full-text search, complex queries with joins

Redis - In-memory data store

Why chosen: Microsecond latency for caching, pub/sub for real-time updates, simple data structures
Role: Authentication token cache, worker heartbeats, WebSocket session state
Key features: TTL-based expiration, atomic operations, pub/sub messaging

Temporal - Workflow orchestration engine

Why chosen: Durable execution, built-in retry logic, distributed task queues, workflow versioning
Role: Orchestrate all agent and team executions, manage long-running workflows, ensure reliable task distribution
Key features: Deterministic replay, signal handling, query support, horizontal scalability

LiteLLM - LLM gateway and router

Why chosen: Unified API across providers, automatic retries, cost tracking, response caching
Role: Abstract model provider differences, enable multi-model support, manage API keys and quotas
Key features: OpenAI-compatible interface, fallback chains, streaming support

External Services

Kubiya API - Authentication service

Why integrated: Centralized user management, SSO integration, organization provisioning
Role: Validate JWT tokens, provide organization context, manage API keys
Integration pattern: External validation with Redis caching for performance

OPA (Open Policy Agent) - Policy enforcement

Why chosen: Declarative policy language, decoupled from application code, industry standard for cloud-native authorization
Role: Enforce access control, usage limits, and governance policies
Integration pattern: Synchronous policy evaluation before and during execution

Context Graph - Knowledge API

Why integrated: Organizational memory, cross-agent learning, RAG support
Role: Provide historical context and domain knowledge to enhance agent intelligence
Integration pattern: On-demand queries with caching

Cloud Storage Providers - Object storage (S3/GCS/Azure)

Why multi-cloud: Deployment flexibility, compliance requirements, cost optimization
Role: Store agent artifacts, execution outputs, uploaded files
Integration pattern: Provider-agnostic abstraction with metadata in PostgreSQL

Architectural Principles

Separation of Concerns: Each technology addresses specific requirements (data persistence vs. caching vs. orchestration)
Best-of-Breed: Choose proven technologies rather than monolithic solutions
Horizontal Scalability: All components can scale independently
Cloud-Native: Designed for containerized, distributed deployments
Operational Maturity: Established technologies with strong community support and tooling

Data Model Overview

The Control Plane’s data model supports multi-tenancy, hierarchical resource organization, and complete execution traceability. Entity Hierarchy:

Organization (Root Tenant)
├── Environments
│   ├── Task Queues
│   │   └── Workers
│   ├── Agents
│   └── Teams
├── Projects
│   ├── Knowledge Items
│   └── Resources
├── Policies
├── Secrets
└── Integration Credentials

Multi-Tenancy Design: All data is organizationally scoped with database-level enforcement:

Row-Level Security (RLS): PostgreSQL policies automatically filter queries by organization
Isolation Guarantees: No query can access data from another organization
Shared Infrastructure: All organizations use same database instance with logical separation

Execution Tracking: Complete audit trail from task submission through completion:

Execution Records: Status, timestamps, initiator, target agent/team
Session History: Full conversation logs with token usage and costs
Analytics Data: Aggregated metrics for dashboards and reporting
Relationship Tracking: Links between executions, jobs, and schedules

Schema Evolution: Database migrations managed through automated tooling:

Version Control: All schema changes tracked in migration history
Zero-Downtime Deployments: Migrations designed for rolling updates
Rollback Support: Backward-compatible changes enable safe rollbacks

Monitoring & Analytics

Comprehensive observability enables operational insight, troubleshooting, and cost optimization. Structured Logging: JSON-formatted logs with:

Correlation IDs: Track requests across distributed services
Contextual Fields: Organization ID, execution ID, user ID automatically included
Log Levels: Debug, info, warning, error for filtering
Structured Data: Machine-readable format for log aggregation and analysis

Execution Analytics: Real-time and historical metrics for:

Token Usage: Prompt tokens, completion tokens, total tokens per execution
Cost Tracking: Per-execution costs based on model pricing
Latency Metrics: Time spent in different workflow phases
State Transitions: Track how executions move through pending, running, waiting, completed, failed states

Operational Health: System health endpoints provide:

Service Status: Component-level health checks (database, Redis, Temporal connectivity)
Worker Presence: Real-time view of active workers per queue
Queue Depth: Pending task counts for capacity planning
Error Rates: Failed execution percentages and error categorization

Analytics Architecture: Real-Time Layer:

Turn-by-turn analytics captured during execution
Immediate visibility into token usage and costs
WebSocket streaming for live dashboard updates

Historical Layer:

Aggregated metrics in PostgreSQL for dashboards
Time-series analysis for trend identification
Exportable data for external BI tools

Integration Points:

Prometheus: Metrics export for alerting and monitoring
OpenTelemetry: Distributed tracing for request path analysis
Custom Webhooks: Real-time event notifications for external systems

This multi-layer observability approach balances real-time operational needs with long-term analytical requirements.

Architectural Qualities

The Control Plane is designed to meet enterprise requirements for security, reliability, and scale.

Security & Governance

Multi-Tenancy: Database-level tenant isolation with row-level security policies Authentication: JWT and API key support with flexible provider integration Authorization: Declarative OPA policies for fine-grained access control Audit Trail: Complete execution history with immutable logs Secret Management: Secure storage and injection of credentials Network Isolation: Support for private networking in self-hosted deployments

Reliability & Resilience

Durable Execution: Temporal workflows survive failures and infrastructure changes Automatic Retries: Configurable retry policies with exponential backoff Graceful Degradation: System remains operational even if external services are unavailable Health Monitoring: Continuous worker health checks with automatic routing around failures State Persistence: All critical state stored in durable data stores (PostgreSQL, Temporal) Idempotent Operations: Safe to retry any operation without unintended side effects

Scalability & Performance

Horizontal Scaling: All components scale independently (API servers, workers, databases) Distributed Task Queues: Temporal provides natural work distribution across worker pools Caching Strategy: Multi-tier caching (Redis) reduces database load and external API calls Asynchronous Operations: Non-blocking I/O for high-concurrency workloads Connection Pooling: Efficient database connection management Resource Isolation: Individual executions don’t impact each other

Operational Excellence

Structured Observability: JSON logs, metrics, and traces for complete system visibility Zero-Downtime Deployments: Rolling updates with backward-compatible changes Configuration Management: Environment-based config with secret separation API-Driven: All operations available via REST API for automation Self-Service: Teams can provision and manage resources without operator intervention Cost Transparency: Detailed usage tracking and cost attribution per organization

Integration & Extensibility

Multi-Cloud Support: Run on any Kubernetes cluster or serverless platform Storage Flexibility: Support for AWS S3, GCP, Azure, or S3-compatible storage Model Agnostic: Support for any LLM provider through LiteLLM gateway MCP Protocol: Standardized integration with external tools and services Webhook Support: Real-time event notifications for external systems API-First Design: Full platform capabilities accessible programmatically

Next Steps

Explore related documentation to understand how components work together:

Control Plane Overview

Deployment options and operational concepts

Developers Guide

Local development setup and API reference

Agents

Create and configure AI agents

Teams

Coordinate multi-agent teams

Runtimes

Choose execution runtimes

Environments

Configure execution environments

Introduction

Quick Start

Web Interface

Core Concepts

Infrastructure

Deployment Options

Kubiya Hosted (SaaS)

Self-Hosted

High-Level Architecture

Control Plane API

Authentication & Authorization

Policy Enforcement

Worker Registration & Lifecycle

Task Queue Architecture

Execution Flow

Temporal Workflow Architecture

LLM Gateway

Context Graph Integration

Cognitive Memory

Storage System

Technology Stack & Dependencies

Core Infrastructure

External Services

Architectural Principles

Data Model Overview

Monitoring & Analytics

Architectural Qualities

Security & Governance

Reliability & Resilience

Scalability & Performance

Operational Excellence

Integration & Extensibility

Next Steps

Control Plane Overview

Developers Guide

Agents

Teams

Runtimes

Environments

Introduction

Quick Start

Web Interface

Core Concepts

Infrastructure

​Deployment Options

​Kubiya Hosted (SaaS)

​Self-Hosted

​High-Level Architecture

​Control Plane API

​Authentication & Authorization

​Policy Enforcement

​Worker Registration & Lifecycle

​Task Queue Architecture

​Execution Flow

​Temporal Workflow Architecture

​LLM Gateway

​Context Graph Integration

​Cognitive Memory

​Storage System

​Technology Stack & Dependencies

​Core Infrastructure

​External Services

​Architectural Principles

​Data Model Overview

​Monitoring & Analytics

​Architectural Qualities

​Security & Governance

​Reliability & Resilience

​Scalability & Performance

​Operational Excellence

​Integration & Extensibility

​Next Steps

Control Plane Overview

Developers Guide

Agents

Teams

Runtimes

Environments

Deployment Options

Kubiya Hosted (SaaS)

Self-Hosted

High-Level Architecture

Control Plane API

Authentication & Authorization

Policy Enforcement

Worker Registration & Lifecycle

Task Queue Architecture

Execution Flow

Temporal Workflow Architecture

LLM Gateway

Context Graph Integration

Cognitive Memory

Storage System

Technology Stack & Dependencies

Core Infrastructure

External Services

Architectural Principles

Data Model Overview

Monitoring & Analytics

Architectural Qualities

Security & Governance

Reliability & Resilience

Scalability & Performance

Operational Excellence

Integration & Extensibility

Next Steps