Datasets

Datasets are containers for organizing related memories and knowledge within Cognitive Memory. Each dataset defines a scope of access (private, organizational, or role-based) and provides isolation between different collections of knowledge.

What is a Dataset?

A dataset is a logical collection of memories that share:

Common access controls - Who can read, write, and manage the memories
Unified scope - Private (USER), shared (ORG), or role-based (ROLE)
Semantic relationships - Connected knowledge graphs within the collection
Audit trails - Complete history of operations and agent interactions

Think of datasets as knowledge repositories that can be scoped to individuals, teams, or entire organizations.

Dataset Scopes

Every dataset must have a visibility scope that determines who can access its memories.

USER Scope (Private)

Private datasets accessible only to the user or agent that created them. Use cases:

Personal notes and reminders
Agent-specific learning and context
Private troubleshooting logs
Individual experimentation

ORG Scope (Organization)

Shared datasets accessible to all members of the organization. This is the default scope for agents. Use cases:

Team runbooks and procedures
Shared incident history
Organizational knowledge base
Cross-team collaboration
Agent memory sharing

Why agents use ORG scope by default:

Enables knowledge sharing between agents
Creates collective team intelligence
Reduces redundant problem-solving
Builds organizational memory over time

ROLE Scope (Role-Based)

Datasets accessible only to users with specific roles. Use cases:

Security-sensitive procedures
Department-specific knowledge
Compliance documentation
Role-specific workflows

Creating Datasets

Via Composer UI

Navigate to Cognitive Memory → Datasets
Click Create Dataset
Configure:
- Name: Clear, descriptive identifier (e.g., “production-runbooks”)
- Description: Purpose and contents
- Visibility Scope: Private, Organization, or Role-based
Click Create Dataset

The dataset is immediately available for storing memories.

Via CLI

# Create organization-scoped dataset
kubiya cognitive dataset create production-runbooks \
  --scope org \
  --description "Production environment runbooks and procedures"

# Create private dataset
kubiya cognitive dataset create my-notes \
  --scope user \
  --description "Personal troubleshooting notes"

# Create role-based dataset
kubiya cognitive dataset create security-docs \
  --scope role \
  --roles security-admin,sre-lead \
  --description "Security procedures and incident response"

Environment-Based Datasets

By default, agents automatically use a dataset named after their execution environment.

How It Works

When an agent executes in an environment:

Agent checks for environment variable: KUBIYA_ENVIRONMENT
Uses environment name/slug as dataset name (e.g., “production”, “staging”)
If dataset doesn’t exist, creates it automatically with ORG scope
All memory operations use this environment-based dataset

Example Environment Mapping

Environment	Default Dataset	Scope	Shared With
`production`	`production`	ORG	All agents in production
`staging`	`staging`	ORG	All agents in staging
`dev`	`dev`	ORG	All agents in dev

Benefits of Environment-Based Datasets

Automatic isolation:

Production memories stay in production
Staging experiments don’t pollute production knowledge
Development testing is isolated

Shared team context:

All agents in the same environment share knowledge
Agent A stores finding → Agent B recalls it later
Collective learning within each environment

Zero configuration:

No setup required
Works out of the box
Automatic dataset creation

Example agent workflow:

Adding Data to Datasets

Upload Files

Upload documents, markdown files, JSON, or code: Via UI:

Click Upload Files
Select target dataset
Drag files or click to browse
Supports: Text, Markdown, JSON, Code (up to 10MB)

Add Text

Store unstructured text, notes, or documentation: Via UI:

Click Add Text
Select target dataset
Add title (optional) and content
Click Add Note

Ingest Code

Import code repositories for analysis and knowledge extraction: Via UI (Git URL):

Click Ingest Code
Select target dataset
Enter Git repository URL
Specify branch (default: main)
Enable authentication if private repo
Click Ingest from Git

Via UI (Local Files):

Ingest Code from Local Files - Dark Mode

Click Ingest Code → Local Files tab
Select target dataset
Drag code files or click to browse
Supports: .py, .js, .ts, .go, .java, .rs, .c, .cpp (max 2MB per file)

Knowledge Processing Pipeline

When data is added to a dataset, it goes through the cognitive processing pipeline:

Ingestion - Raw data is stored
Embedding - Text converted to semantic vectors via LLM
Entity Extraction - Key concepts, relationships identified
Graph Construction - Knowledge graph created in Neo4j
Indexing - Vectors indexed in pgvector for search
Cognification - Background job processes data into structured knowledge

Processing Status:

Pending: Data uploaded, waiting for processing
Processing: Cognitive engine analyzing content
Completed: Ready for semantic search and recall
Failed: Processing error (check audit logs)

Monitor processing in the Background Jobs section.

Dataset Operations

List Datasets

# List all accessible datasets
kubiya cognitive dataset list

# Filter by scope
kubiya cognitive dataset list --scope org

View Dataset Details

# Get dataset information
kubiya cognitive dataset get production-runbooks

# View memories in dataset
kubiya cognitive memory list --dataset production-runbooks

Delete Dataset

# Delete dataset and all memories
kubiya cognitive dataset delete production-runbooks --confirm

Deleting a dataset is permanent and removes all associated memories, embeddings, and knowledge graph data. This action cannot be undone.

Best Practices

Naming Conventions

Use kebab-case: production-runbooks, security-procedures
Include environment or scope: prod-incidents, staging-docs
Be descriptive: sre-kubernetes-runbooks vs docs

Scope Selection

Default to ORG scope for team collaboration
Use USER scope only for truly private data
Reserve ROLE scope for sensitive, compliance-driven content

Dataset Organization

Group by purpose: runbooks, incidents, procedures, notes
Separate by environment: production, staging, development
Divide by team: sre-knowledge, devops-procedures, security-docs

Data Hygiene

Regular cleanup of outdated memories
Audit access logs periodically
Document dataset purpose in description
Use metadata consistently for filtering

Here’s how agents share knowledge through organization-scoped datasets: Key benefits:

Agent B doesn’t re-solve what Agent A already fixed
Organizational knowledge compounds over time
Audit trail shows which agent solved what
Team learns collectively

Multi-Tenant Security

Datasets enforce strict multi-tenancy:

Organization isolation - Org A cannot access Org B’s datasets
User-level filtering - USER-scoped datasets are private
RBAC enforcement - ROLE-scoped datasets check user roles
Audit logging - Complete history of all operations
Data encryption - At rest and in transit

All operations automatically include organization context from the authenticated user or agent.

Next Steps

Semantic Search

Learn how to query and recall memories

Context Graph

Explore the broader Context Graph

CLI Reference

Complete CLI command reference

SDK Reference

Python SDK integration guide

Introduction

Quick Start

Web Interface

Core Concepts

Infrastructure

What is a Dataset?

Dataset Scopes

USER Scope (Private)

ORG Scope (Organization)

ROLE Scope (Role-Based)

Creating Datasets

Via Composer UI

Via CLI

Environment-Based Datasets

How It Works

Example Environment Mapping

Benefits of Environment-Based Datasets

Adding Data to Datasets

Upload Files

Add Text

Ingest Code

Knowledge Processing Pipeline

Dataset Operations

List Datasets

View Dataset Details

Delete Dataset

Best Practices

Naming Conventions

Scope Selection

Dataset Organization

Data Hygiene

Multi-Tenant Security

Next Steps

Semantic Search

Context Graph

CLI Reference

SDK Reference

Introduction

Quick Start

Web Interface

Core Concepts

Infrastructure

​What is a Dataset?

​Dataset Scopes

​USER Scope (Private)

​ORG Scope (Organization)

​ROLE Scope (Role-Based)

​Creating Datasets

​Via Composer UI

​Via CLI

​Environment-Based Datasets

​How It Works

​Example Environment Mapping

​Benefits of Environment-Based Datasets

​Adding Data to Datasets

​Upload Files

​Add Text

​Ingest Code

​Knowledge Processing Pipeline

​Dataset Operations

​List Datasets

​View Dataset Details

​Delete Dataset

​Best Practices

​Naming Conventions

​Scope Selection

​Dataset Organization

​Data Hygiene

​Agent Memory Sharing Example

​Multi-Tenant Security

​Next Steps

Semantic Search

Context Graph

CLI Reference

SDK Reference

What is a Dataset?

Dataset Scopes

USER Scope (Private)

ORG Scope (Organization)

ROLE Scope (Role-Based)

Creating Datasets

Via Composer UI

Via CLI

Environment-Based Datasets

How It Works

Example Environment Mapping

Benefits of Environment-Based Datasets

Adding Data to Datasets

Upload Files

Add Text

Ingest Code

Knowledge Processing Pipeline

Dataset Operations

List Datasets

View Dataset Details

Delete Dataset

Best Practices

Naming Conventions

Scope Selection

Dataset Organization

Data Hygiene

Agent Memory Sharing Example

Multi-Tenant Security

Next Steps