Deployment Options
Choose between fully managed SaaS or self-hosted deployment based on your requirements.Kubiya Hosted (SaaS)
Fully managed infrastructure with zero operational overhead. All components, scaling, security, and maintenance handled by Kubiya. Benefits: Zero ops • Auto-scaling • Enterprise security • 99.9% uptime SLA • Continuous updatesSelf-Hosted
Stack: Docker Compose (local) • Kubernetes/Helm (production) • Temporal Server • PostgreSQL • Redis Benefits: Full control • On-premises • Custom security • Air-gapped supportHigh-Level Architecture
Distributed, multi-tenant architecture for AI agent orchestration. Core Components:- API Layer: FastAPI-based REST and WebSocket interface for multi-tenant access
- Orchestration: Temporal workflows for reliable, distributed task execution
- Data Persistence: PostgreSQL with row-level security for tenant isolation
- Caching & State: Redis for performance optimization and real-time communication
- Security: JWT/API key authentication with OPA-based policy enforcement
- LLM Gateway: LiteLLM for unified access to multiple AI model providers
- Knowledge Layer: Context Graph for organizational memory and cross-agent learning
Control Plane API
The Control Plane API provides a comprehensive REST and WebSocket interface for managing the entire lifecycle of AI agents, teams, and workflows. Architectural Approach:- Multi-tenant by design: All data access enforced at the database level through row-level security
- Async-first architecture: Built on FastAPI with async/await for high-concurrency workloads
- Strongly typed contracts: Pydantic models ensure request/response validation
- Real-time communication: WebSocket endpoints for streaming execution logs and live status updates
- Structured observability: Correlation IDs and structured logging for distributed tracing
- Agent & Team Management: Create, configure, and orchestrate AI agents and multi-agent teams
- Execution Control: Submit tasks, monitor progress, stream outputs, and handle approvals
- Resource Management: Task queue registration, environment configuration, model registry
- Governance: Policy definition, secret management, integration credentials
- Analytics & Observability: Usage metrics, cost tracking, execution history
Authentication & Authorization
Multi-tenant authentication with organization-level isolation ensures secure access across all Control Plane operations. Authentication Methods:- JWT Bearer Tokens: User-scoped authentication with role-based access control
- API Keys: Service-to-service authentication for automation and integrations
- Token Caching: Redis-backed cache layer minimizes external validation latency
- Organization Scoping: Every request automatically scoped to authenticated organization
- Database-Level Isolation: PostgreSQL row-level security (RLS) enforces tenant boundaries
- Role-Based Access Control (RBAC): Fine-grained permissions for resource access
- Automatic Token Rotation: Built-in support for credential lifecycle management
Policy Enforcement
OPA-based access control and governance. Policy Types: Tool usage control • Resource access restrictions • Execution constraints (limits/quotas) • Approval workflows Architecture Decision: Why OPA?- Decoupled policy logic: Policies defined separately from application code enable non-engineers to manage governance
- Declarative approach: Rego policy language allows expressing complex rules as data
- Centralized enforcement: Single policy engine across all agents, teams, and workflows
- Audit transparency: Policy decisions are logged with full context for compliance reporting
- Dynamic evaluation: Policies evaluated at execution time based on current state and context
- Tool Usage Control: Restrict which Skills and MCP servers agents can invoke
- Resource Access: Limit file paths, network destinations, or cloud resources
- Execution Constraints: Enforce quotas on compute time, token usage, or cost
- Approval Workflows: Require human-in-the-loop approval for sensitive operations
- Pre-execution: Validate permissions before workflow submission
- During execution: Monitor and enforce limits in real-time
- Post-execution: Generate audit logs for compliance and security review
Worker Registration & Lifecycle
Workers register with the Control Plane to receive and execute tasks, establishing a durable connection to the distributed orchestration system. Registration Flow:- Initial Connection: Worker authenticates with API key and queue identifier
- Configuration Receipt: Control Plane provides connection credentials for Temporal and LiteLLM
- Queue Binding: Worker connects to organization-specific Temporal task queue
- Polling Activation: Worker begins polling for available workflow tasks
- Lightweight Heartbeats: Frequent status updates with minimal overhead (task count, current state)
- Full Heartbeats: Periodic detailed reports including system metrics and execution logs
- Redis-Based State: Fast lookups for worker status without database queries
- Automatic Staleness Detection: Workers that stop reporting are marked inactive
Task Queue Architecture
Hierarchical task routing and worker management.For a beginner-friendly introduction to workers and task queues, see Workers Overview and Task Queues.
- Tenant Isolation: Organization-level separation ensures complete data and execution isolation
- Environment Segregation: Separate queues for production, staging, and development prevent cross-contamination
- Capacity Management: Task queues provide named capacity pools that can be scaled independently
- Flexible Routing: Multiple routing strategies support different use cases:
- AUTO: Temporal load-balances across available workers (default)
- SPECIFIC_QUEUE: Direct targeting for specialized hardware (GPU, high-memory)
- ENVIRONMENT: Restrict execution to specific deployment contexts
{organization_id}.{queue_uuid}, ensuring global uniqueness while maintaining logical hierarchy. This naming scheme enables:
- Fast organization-level filtering and metrics
- Queue-level access control policies
- Clear audit trails in execution logs
Execution Flow
End-to-end orchestration from user request through task completion, with support for long-running workflows and human-in-the-loop interactions. Workflow Types:- Agent Execution: Single-agent task processing with tool invocation and multi-turn conversations
- Team Execution: Multi-agent coordination with inter-agent communication and shared context
- Scheduled Jobs: Cron-based recurring workflows for automation and monitoring
- WebSocket Streaming: Live execution logs and status updates
- Query Endpoints: Synchronous status checks without interrupting execution
- Signal Handling: External events can influence running workflows (user input, cancellation)
Temporal Workflow Architecture
Architecture Decision: Why Temporal? Temporal addresses several critical challenges in distributed agent orchestration:- Durability: Workflow state persists across failures, restarts, and infrastructure changes
- Deterministic Replay: Workflows can be reconstructed and replayed for debugging or recovery
- Built-in Retry Logic: Configurable retry policies with exponential backoff for transient failures
- Long-Running Workflows: Support for workflows that run hours or days (HITL, scheduled jobs)
- Distributed Task Queues: Native load balancing across heterogeneous worker pools
- Versioning: Safe deployment of workflow changes without disrupting in-flight executions
- Agent Workflows: Linear execution with dynamic tool invocation and state transitions
- Team Workflows: Parallel and sequential agent coordination with shared context
- Scheduled Workflows: Wrapper workflows that handle cron triggers and error handling
- LLM Inference: Calls to model providers through LiteLLM gateway
- Database Operations: Session storage, execution status updates
- Analytics: Token usage, cost tracking, turn metrics
- External Integration: Calls to Context Graph, policy enforcer, storage services
- At-least-once execution for all activities (idempotency required)
- Exactly-once workflow decisions (deterministic replay)
- Strong consistency for workflow state
LLM Gateway
Architecture Decision: Why LiteLLM? LiteLLM solves the challenge of managing multiple LLM providers with incompatible APIs:- Unified Interface: All providers exposed through OpenAI-compatible API
- Zero Code Changes: Switch providers by changing model identifier, no code changes
- Automatic Retries: Built-in retry logic with exponential backoff for rate limits
- Response Caching: Reduce costs and latency for repeated queries
- Cost Tracking: Unified token counting and cost calculation across providers
- Fallback Chains: Automatic failover to backup models on provider outages
- Commercial: OpenAI, Anthropic, Google, Microsoft Azure OpenAI, Mistral, Cohere
- Open Source: Replicate, Together AI, Anyscale, Perplexity
- Self-Hosted: vLLM, Ollama, LM Studio, custom OpenAI-compatible endpoints
{provider}/{model-name}:
openai/gpt-4o→ OpenAI GPT-4oanthropic/claude-sonnet-4→ Anthropic Claude Sonnetgemini/gemini-2.0-flash→ Google Gemini
Context Graph Integration
The Context Graph provides persistent organizational knowledge that enhances agent intelligence across all executions. Architectural Role: The Context Graph serves as a central knowledge repository that agents query to:- Historical Context: Learn from past executions, decisions, and outcomes
- Organizational Knowledge: Access company-specific information, terminology, and processes
- Cross-Agent Learning: Benefit from insights gathered by other agents and teams
- Entity Relationships: Understand connections between projects, resources, and team members
- Authenticated Access: All queries inherit organization context from the authenticated session
- Query Translation: Natural language questions translated to structured graph queries
- Result Enrichment: Graph results augmented with execution-time context
- Caching: Frequently accessed knowledge cached in Redis for performance
- Onboarding: New agents immediately benefit from accumulated organizational knowledge
- Consistency: All agents reference the same authoritative sources
- Evolution: Knowledge base grows automatically from agent interactions and human feedback
- Retrieval-Augmented Generation (RAG): Context-aware responses based on organizational data
Cognitive Memory
Persistent session storage enables agents to maintain context across multi-turn conversations and long-running workflows. Memory Architecture: Cognitive memory is stored in PostgreSQL using a structured schema that captures:- Conversation History: Complete message sequences with timestamps and metadata
- Agent State: Current context variables, active tools, and execution phase
- Session Metadata: User identifiers, environment context, policy constraints
- Turn Analytics: Token usage, latency, and cost per interaction
- Survive Restarts: Sessions persist across worker failures and deployments
- Enable Resumption: Long-running tasks can pause and resume without losing context
- Support HITL: Human-in-the-loop workflows maintain full conversation state while waiting
- Facilitate Debugging: Complete execution history available for post-mortem analysis
- Reference earlier parts of the conversation
- Build incrementally on previous decisions
- Clarify ambiguous requests through back-and-forth dialogue
- Maintain task context across multiple human inputs
Storage System
Architecture Decision: Multi-Cloud Strategy Supporting multiple cloud providers solves several challenges:- Deployment Flexibility: Customers can use their existing cloud infrastructure
- Compliance Requirements: Data residency regulations may dictate storage location
- Cost Optimization: Organizations can leverage existing cloud commitments
- Vendor Independence: Avoid lock-in to single cloud provider
- Hybrid Deployments: Self-hosted control planes can use on-premises storage
- AWS S3: Industry-standard object storage with broad regional availability
- Google Cloud Storage: Deep integration with GCP services
- Azure Blob Storage: Native support for Azure-based deployments
- S3-Compatible: MinIO, Wasabi, Backblaze B2, or any S3-compatible service
- Fast searches without scanning object storage
- Tag-based filtering and organization
- Relationship tracking (which executions used which files)
- Versioning and lifecycle management
- Presigned URLs for direct uploads (bypassing Control Plane for large files)
- Batch operations for efficiency
- Streaming downloads for large files
- Tag-based search and filtering
Technology Stack & Dependencies
The Control Plane’s architecture is built on proven, production-grade technologies chosen for specific technical requirements.Core Infrastructure
PostgreSQL - Multi-tenant relational database- Why chosen: Native row-level security (RLS) for tenant isolation, ACID guarantees, rich query capabilities
- Role: Primary data store for agents, teams, executions, policies, and all configuration
- Key features: JSON columns for flexible schemas, full-text search, complex queries with joins
- Why chosen: Microsecond latency for caching, pub/sub for real-time updates, simple data structures
- Role: Authentication token cache, worker heartbeats, WebSocket session state
- Key features: TTL-based expiration, atomic operations, pub/sub messaging
- Why chosen: Durable execution, built-in retry logic, distributed task queues, workflow versioning
- Role: Orchestrate all agent and team executions, manage long-running workflows, ensure reliable task distribution
- Key features: Deterministic replay, signal handling, query support, horizontal scalability
- Why chosen: Unified API across providers, automatic retries, cost tracking, response caching
- Role: Abstract model provider differences, enable multi-model support, manage API keys and quotas
- Key features: OpenAI-compatible interface, fallback chains, streaming support
External Services
Kubiya API - Authentication service- Why integrated: Centralized user management, SSO integration, organization provisioning
- Role: Validate JWT tokens, provide organization context, manage API keys
- Integration pattern: External validation with Redis caching for performance
- Why chosen: Declarative policy language, decoupled from application code, industry standard for cloud-native authorization
- Role: Enforce access control, usage limits, and governance policies
- Integration pattern: Synchronous policy evaluation before and during execution
- Why integrated: Organizational memory, cross-agent learning, RAG support
- Role: Provide historical context and domain knowledge to enhance agent intelligence
- Integration pattern: On-demand queries with caching
- Why multi-cloud: Deployment flexibility, compliance requirements, cost optimization
- Role: Store agent artifacts, execution outputs, uploaded files
- Integration pattern: Provider-agnostic abstraction with metadata in PostgreSQL
Architectural Principles
- Separation of Concerns: Each technology addresses specific requirements (data persistence vs. caching vs. orchestration)
- Best-of-Breed: Choose proven technologies rather than monolithic solutions
- Horizontal Scalability: All components can scale independently
- Cloud-Native: Designed for containerized, distributed deployments
- Operational Maturity: Established technologies with strong community support and tooling
Data Model Overview
The Control Plane’s data model supports multi-tenancy, hierarchical resource organization, and complete execution traceability. Entity Hierarchy:- Row-Level Security (RLS): PostgreSQL policies automatically filter queries by organization
- Isolation Guarantees: No query can access data from another organization
- Shared Infrastructure: All organizations use same database instance with logical separation
- Execution Records: Status, timestamps, initiator, target agent/team
- Session History: Full conversation logs with token usage and costs
- Analytics Data: Aggregated metrics for dashboards and reporting
- Relationship Tracking: Links between executions, jobs, and schedules
- Version Control: All schema changes tracked in migration history
- Zero-Downtime Deployments: Migrations designed for rolling updates
- Rollback Support: Backward-compatible changes enable safe rollbacks
Monitoring & Analytics
Comprehensive observability enables operational insight, troubleshooting, and cost optimization. Structured Logging: JSON-formatted logs with:- Correlation IDs: Track requests across distributed services
- Contextual Fields: Organization ID, execution ID, user ID automatically included
- Log Levels: Debug, info, warning, error for filtering
- Structured Data: Machine-readable format for log aggregation and analysis
- Token Usage: Prompt tokens, completion tokens, total tokens per execution
- Cost Tracking: Per-execution costs based on model pricing
- Latency Metrics: Time spent in different workflow phases
- State Transitions: Track how executions move through pending, running, waiting, completed, failed states
- Service Status: Component-level health checks (database, Redis, Temporal connectivity)
- Worker Presence: Real-time view of active workers per queue
- Queue Depth: Pending task counts for capacity planning
- Error Rates: Failed execution percentages and error categorization
- Turn-by-turn analytics captured during execution
- Immediate visibility into token usage and costs
- WebSocket streaming for live dashboard updates
- Aggregated metrics in PostgreSQL for dashboards
- Time-series analysis for trend identification
- Exportable data for external BI tools
- Prometheus: Metrics export for alerting and monitoring
- OpenTelemetry: Distributed tracing for request path analysis
- Custom Webhooks: Real-time event notifications for external systems