Use Cases & Examples

Kubiya transforms common operational tasks from time-consuming manual processes into intelligent, automated workflows. Here are the most impactful use cases teams implement, with concrete examples and benefits.

DevOps Automation

Application Deployments

Transform deployment processes from error-prone manual steps to reliable, auditable workflows:

Traditional Process
Kubiya Automation

Manual deployment steps (45-90 minutes):

Developer requests deployment in Slack
DevOps engineer validates readiness
Manual kubectl commands or CI/CD trigger
Manual monitoring of health checks
Manual rollback if issues detected
Status updates scattered across tools

Real example workflow:

from kubiya import Workflow, Step

# Deployment workflow with proper validation and rollback
deployment_workflow = Workflow(
    name="microservice-deployment",
    description="Deploy microservice with health checks and rollback"
)

# Pre-deployment validation
deployment_workflow.add_step(
    Step("validate-build")
    .tool("github")
    .command("gh api repos/myorg/user-service/actions/runs --jq '.workflow_runs[0].conclusion'")
    .condition("== 'success'")
)

# Deploy to staging first
deployment_workflow.add_step(
    Step("deploy-staging")
    .tool("kubectl")
    .command("kubectl set image deployment/user-service user-service=user-service:v2.3.1 -n staging")
    .depends(["validate-build"])
)

# Health check
deployment_workflow.add_step(
    Step("health-check")
    .tool("curl")
    .command("curl -f http://user-service.staging.svc.cluster.local/health")
    .retry(3)
    .depends(["deploy-staging"])
)

# Production deployment
deployment_workflow.add_step(
    Step("deploy-production")
    .tool("kubectl")
    .command("kubectl set image deployment/user-service user-service=user-service:v2.3.1 -n production")
    .depends(["health-check"])
)

Infrastructure Provisioning

Automate infrastructure changes with safety and consistency:

Common scenarios:

“Spin up a development environment for the mobile team”
“Scale production cluster capacity for Black Friday traffic”
“Create disaster recovery environment in us-east-2”
“Decommission old staging resources to reduce costs”

# Auto-scaling workflow example
name: traffic-surge-response
trigger: 
  type: metric_threshold
  condition: "avg(cpu_utilization) > 75% for 10 minutes"
  
steps:
  - name: analyze-traffic-patterns
    tool: datadog-analyzer
    
  - name: provision-additional-nodes
    tool: terraform-executor
    inputs:
      action: apply
      var_file: surge-scaling.tfvars
      
  - name: update-load-balancer
    tool: aws-alb
    inputs:
      targets: ${provision-additional-nodes.new_instances}
      
  - name: notify-team
    tool: slack
    message: "🚀 Auto-scaled cluster for traffic surge: +${new_node_count} nodes"

Release Management

Coordinate complex releases across multiple services and teams:

Multi-service release workflow:

name: quarterly-release-q2-2024
description: "Coordinated release across 12 microservices"

phases:
  - name: pre-release-validation
    parallel_checks:
      - database-migration-tests
      - api-compatibility-validation  
      - security-vulnerability-scans
      - performance-regression-tests
      
  - name: staged-rollout
    sequence:
      - services: [auth-service, user-service] # Foundation services first
        strategy: blue-green
      - services: [payment-service, order-service] # Business logic
        strategy: canary
        depends_on: [auth-service, user-service]
      - services: [notification-service, analytics-service] # Supporting services
        strategy: rolling
        depends_on: [payment-service, order-service]
        
  - name: post-release-verification
    checks:
      - end-to-end-user-journey-tests
      - business-metrics-validation
      - error-rate-monitoring
      - performance-benchmarks

Site Reliability Engineering (SRE)

Incident Response & Remediation

Accelerate mean time to resolution (MTTR) with intelligent automation:

Before Kubiya

MTTR: 45-120 minutes

Manual log collection from multiple systems
Context switching between monitoring tools
Tribal knowledge required for diagnosis
Manual remediation steps

With Kubiya

MTTR: 10-25 minutes

Automated evidence collection
AI-assisted root cause analysis
Context-aware remediation suggestions
One-click remediation execution

Example incident response workflow:

PagerDuty Alert: "Payment API 5xx errors spiking"
↓
Kubiya automatically:
1. Collects logs from payment service pods
2. Queries database connection metrics  
3. Checks recent deployments and changes
4. Analyzes error patterns with AI
5. Suggests remediation: "Scale payment-db connection pool"
6. Presents one-click remediation options

Capacity Planning & Scaling

Proactive resource management based on usage patterns and business events:

name: black-friday-capacity-preparation  
schedule: "October 1st annually"

steps:
  - name: analyze-historical-traffic
    tool: traffic-analyzer
    inputs:
      timeframe: "last 3 black fridays"
      services: ["web-frontend", "api-gateway", "payment-service"]
      
  - name: forecast-capacity-needs
    tool: ml-forecaster
    inputs:
      historical_data: ${analyze-historical-traffic.patterns}
      growth_rate: 15% # Expected YoY growth
      
  - name: pre-scale-infrastructure
    tool: multi-cloud-scaler
    inputs:
      aws_scaling: ${forecast-capacity-needs.aws_requirements}
      gcp_scaling: ${forecast-capacity-needs.gcp_requirements}
      
  - name: validate-scaling
    tool: load-tester
    inputs:
      traffic_multiplier: ${forecast-capacity-needs.peak_multiplier}
      
  - name: create-runbook
    tool: documentation-generator
    inputs:
      scaling_decisions: ${pre-scale-infrastructure.actions}
      validation_results: ${validate-scaling.performance_metrics}

Chaos Engineering & Testing

Automated resilience testing with safe failure injection:

name: monthly-chaos-testing
description: "Automated resilience validation"

experiments:
  - name: database-connection-failure
    target: payment-service
    failure_type: network_partition
    blast_radius: 25% # Only affect 25% of instances
    duration: 5m
    success_criteria:
      - circuit_breaker_triggered: true
      - fallback_mechanism_activated: true
      - user_impact: < 0.1% # Less than 0.1% user errors
      
  - name: memory-pressure-test
    target: recommendation-engine  
    failure_type: memory_leak_simulation
    ramp_up: gradual # Increase memory pressure slowly
    abort_conditions:
      - pod_restart_required: true
      - response_latency: > 2000ms

Automated Runbooks

Convert tribal knowledge into executable, maintainable automation:

Database performance investigation runbook:

name: database-performance-investigation
trigger: 
  alert: "Database query latency > 500ms for 5 minutes"

investigation_steps:
  - name: collect-slow-queries
    tool: postgres-analyzer
    outputs: [slow_query_log, query_plans]
    
  - name: check-connection-pool
    tool: pgbouncer-metrics
    outputs: [pool_utilization, wait_times]
    
  - name: analyze-resource-usage
    tool: system-metrics
    inputs:
      resources: [cpu, memory, disk_io, network]
      timeframe: 1h
      
  - name: generate-recommendations
    tool: ai-analyzer
    inputs:
      slow_queries: ${collect-slow-queries.slow_query_log}
      resource_metrics: ${analyze-resource-usage.metrics}
      pool_metrics: ${check-connection-pool.utilization}
      
automated_remediation:
  - condition: "connection_pool_exhausted"
    action: scale_connection_pool
    parameters:
      new_pool_size: ${current_pool_size * 1.5}
      
  - condition: "missing_index_detected"  
    action: create_maintenance_ticket
    parameters:
      priority: high
      description: "Index needed: ${generate-recommendations.suggested_indexes}"

Platform Engineering

Self-Service Workflows

Enable development teams with safe, governed self-service capabilities:

Developer self-service scenarios:

“Create a preview environment for PR #1234”
“Run end-to-end tests against staging”
“Scale down my development environment overnight”
“Get performance metrics for my service over the last week”

name: developer-environment-management
permissions:
  - role: developer
    allowed_environments: [dev, staging]
    restricted_operations: [production_access, resource_deletion]

templates:
  - name: create-preview-env
    description: "Spin up isolated environment for feature testing"
    inputs:
      - name: branch_name
        type: string
        required: true
      - name: services_to_deploy  
        type: array
        default: ["frontend", "api"]
        
  - name: run-test-suite
    description: "Execute comprehensive test suite"
    inputs:
      - name: test_type
        type: select
        options: ["unit", "integration", "e2e", "performance"]
      - name: target_environment
        type: select  
        options: ["dev", "staging"]

Compliance & Security Automation

Automate security checks and compliance reporting:

name: security-compliance-scan
schedule: "daily at 2 AM"

scans:
  - name: vulnerability-assessment
    tool: trivy-scanner
    targets: [container-images, kubernetes-manifests]
    
  - name: configuration-drift-check
    tool: config-validator
    policies: [cis-benchmarks, company-policies]
    
  - name: access-review
    tool: rbac-analyzer
    checks: [unused-permissions, overprivileged-accounts]
    
  - name: secret-scanning
    tool: secret-scanner
    repositories: [all-active-repos]
    
reporting:
  - format: pdf
    recipients: [security-team@company.com, compliance@company.com]
  - format: dashboard
    url: https://security-dashboard.company.com
  - format: jira_tickets
    condition: critical_vulnerabilities_found
    assignee: security-team

Cost Optimization

Automated resource management to control cloud spending:

name: cost-optimization-automation
triggers:
  - schedule: "weekdays at 7 PM" # Scale down after hours
  - metric: "monthly_spend > budget_threshold"
  
actions:
  - name: scale-down-dev-environments
    condition: "time.hour >= 19" # After 7 PM
    tool: multi-environment-scaler
    inputs:
      environments: [dev, qa, staging]
      scale_factor: 0.3 # Scale to 30% capacity
      
  - name: identify-unused-resources
    tool: resource-analyzer
    age_threshold: 30d # Unused for 30+ days
    
  - name: rightsizing-recommendations
    tool: rightsizing-analyzer
    lookback_period: 2w
    utilization_threshold: 20% # Under 20% utilization
    
  - name: reserved-instance-optimizer
    tool: ri-optimizer
    savings_threshold: 15% # Only suggest if 15%+ savings
    
notifications:
  - type: slack
    channel: "#platform-cost-alerts"
    message: "💰 Monthly savings achieved: $${cost_savings}"
  - type: executive-report
    recipients: [cto@company.com, cfo@company.com]
    frequency: monthly

ChatOps & Collaboration

Slack/Teams Integration

Bring automation directly into team communication channels:

Slash Commands

/kubiya deploy frontend v2.1.0 to staging

Natural Language

“Can you check if the payment service is healthy?”

Interactive Buttons

Click to approve deployments or execute runbooks

Status Updates

Automated progress updates and completion notifications

Example ChatOps integration:

name: slack-ops-integration
channels:
  - name: "#deployments"
    permissions: [deploy, rollback, status-check]
    environments: [staging, production]
    
  - name: "#incidents"  
    permissions: [investigate, remediate, escalate]
    auto_response: true
    
  - name: "#platform-requests"
    permissions: [create-environment, run-tests, access-logs]
    approval_required: false

interactive_commands:
  - trigger: "/deploy {service} {version} to {environment}"
    workflow: service-deployment
    confirmations:
      - condition: environment == "production"
        message: "⚠️ This will deploy to production. Continue?"
        
  - trigger: "investigate {service} performance"
    workflow: performance-investigation
    auto_execute: true
    
  - trigger: "scale {service} to {replicas} replicas"
    workflow: service-scaling
    safety_checks: [resource-limits, blast-radius]

Advanced Use Cases

Multi-Cloud Orchestration

Manage resources across different cloud providers seamlessly:

name: disaster-recovery-failover
description: "Automated failover from AWS to GCP"

steps:
  - name: detect-aws-outage
    tool: multi-cloud-health-checker
    condition: "aws_availability < 90%"
    
  - name: backup-aws-data
    tool: aws-backup-service
    parallel: true # Run while switching traffic
    
  - name: scale-up-gcp-infrastructure  
    tool: gcp-autoscaler
    inputs:
      target_capacity: ${aws_current_capacity}
      regions: ["us-central1", "us-east1"]
      
  - name: migrate-database
    tool: database-replicator
    source: aws_rds_primary
    destination: gcp_cloud_sql_replica
    
  - name: switch-dns
    tool: route53-updater
    inputs:
      records: ["api.company.com", "app.company.com"]
      new_targets: ${scale-up-gcp-infrastructure.endpoints}
      
  - name: notify-stakeholders
    tool: multi-channel-notifier
    channels: [slack, email, sms, status-page]
    message: "🔄 DR failover completed: AWS → GCP"

Compliance Automation

Maintain regulatory compliance through automated processes:

name: gdpr-data-retention-compliance
schedule: "monthly on 1st day"

steps:
  - name: identify-expired-data
    tool: data-retention-scanner
    policies:
      user_activity_logs: 2_years
      payment_records: 7_years  
      analytics_data: 1_year
      
  - name: data-anonymization
    tool: gdpr-anonymizer
    data_types: [user_profiles, behavioral_data]
    anonymization_method: k_anonymity
    
  - name: secure-deletion
    tool: secure-delete-service
    confirmation_required: true
    audit_trail: complete
    
  - name: compliance-reporting
    tool: compliance-reporter
    outputs: [gdpr_compliance_certificate, audit_log]
    recipients: [dpo@company.com, legal@company.com]

Measuring Success

Key Metrics Teams Track

Operational Efficiency

70% reduction in deployment time
85% fewer manual interventions
40% faster incident resolution
90% reduction in environment setup time

Quality & Reliability

95% deployment success rate
60% fewer production incidents
99.9% automated workflow reliability
Zero security policy violations

Team Productivity

4 hours/week saved per developer
80% of deployments are self-service
50% reduction in after-hours escalations
3x faster onboarding for new team members

Cost Optimization

25% reduction in cloud spend
90% reduction in idle resources
60% better resource utilization
ROI achieved within 3 months

Getting Started

Ready to implement these use cases? Start with the patterns most relevant to your team’s pain points:

DevOps Teams

Begin with deployment automation and infrastructure provisioning

SRE Teams

Start with incident response and automated runbooks

Platform Teams

Focus on self-service workflows and cost optimization

Security Teams

Implement compliance automation and security scanning

Success Pattern: Most teams start with read-only operations (monitoring, status checks) to build confidence, then gradually automate write operations (deployments, scaling) as they see the reliability and audit capabilities.

Introduction

Quick Start

Core Concepts

Using Kubiya

Workflows

MCP Integration

Administration

Reference

​DevOps Automation

​Application Deployments

​Infrastructure Provisioning

​Release Management

​Site Reliability Engineering (SRE)

​Incident Response & Remediation

Before Kubiya

With Kubiya

​Capacity Planning & Scaling

​Chaos Engineering & Testing

​Automated Runbooks

​Platform Engineering

​Self-Service Workflows

​Compliance & Security Automation

​Cost Optimization

​ChatOps & Collaboration

​Slack/Teams Integration

Slash Commands

Natural Language

Interactive Buttons

Status Updates

​Advanced Use Cases

​Multi-Cloud Orchestration

​Compliance Automation

​Measuring Success

​Key Metrics Teams Track

Operational Efficiency

Quality & Reliability

Team Productivity

Cost Optimization

​Getting Started

DevOps Teams

SRE Teams

Platform Teams

Security Teams

DevOps Automation

Application Deployments

Infrastructure Provisioning

Release Management

Site Reliability Engineering (SRE)

Incident Response & Remediation

Capacity Planning & Scaling

Chaos Engineering & Testing

Automated Runbooks

Platform Engineering

Self-Service Workflows

Compliance & Security Automation

Cost Optimization

ChatOps & Collaboration

Slack/Teams Integration

Advanced Use Cases

Multi-Cloud Orchestration

Compliance Automation

Measuring Success

Key Metrics Teams Track

Getting Started