Datasets are containers for organizing related memories and knowledge within Cognitive Memory. Each dataset defines a scope of access (private, organizational, or role-based) and provides isolation between different collections of knowledge.
What is a Dataset?
A dataset is a logical collection of memories that share:
- Common access controls - Who can read, write, and manage the memories
- Unified scope - Private (USER), shared (ORG), or role-based (ROLE)
- Semantic relationships - Connected knowledge graphs within the collection
- Audit trails - Complete history of operations and agent interactions
Think of datasets as knowledge repositories that can be scoped to individuals, teams, or entire organizations.
Dataset Scopes
Every dataset must have a visibility scope that determines who can access its memories.
USER Scope (Private)
Private datasets accessible only to the user or agent that created them.
Use cases:
- Personal notes and reminders
- Agent-specific learning and context
- Private troubleshooting logs
- Individual experimentation
ORG Scope (Organization)
Shared datasets accessible to all members of the organization. This is the default scope for agents.
Use cases:
- Team runbooks and procedures
- Shared incident history
- Organizational knowledge base
- Cross-team collaboration
- Agent memory sharing
Why agents use ORG scope by default:
- Enables knowledge sharing between agents
- Creates collective team intelligence
- Reduces redundant problem-solving
- Builds organizational memory over time
ROLE Scope (Role-Based)
Datasets accessible only to users with specific roles.
Use cases:
- Security-sensitive procedures
- Department-specific knowledge
- Compliance documentation
- Role-specific workflows
Creating Datasets
Via Composer UI
- Navigate to Cognitive Memory → Datasets
- Click Create Dataset
- Configure:
- Name: Clear, descriptive identifier (e.g., “production-runbooks”)
- Description: Purpose and contents
- Visibility Scope: Private, Organization, or Role-based
- Click Create Dataset
The dataset is immediately available for storing memories.
Via CLI
# Create organization-scoped dataset
kubiya cognitive dataset create production-runbooks \
--scope org \
--description "Production environment runbooks and procedures"
# Create private dataset
kubiya cognitive dataset create my-notes \
--scope user \
--description "Personal troubleshooting notes"
# Create role-based dataset
kubiya cognitive dataset create security-docs \
--scope role \
--roles security-admin,sre-lead \
--description "Security procedures and incident response"
Environment-Based Datasets
By default, agents automatically use a dataset named after their execution environment.
How It Works
When an agent executes in an environment:
- Agent checks for environment variable:
KUBIYA_ENVIRONMENT
- Uses environment name/slug as dataset name (e.g., “production”, “staging”)
- If dataset doesn’t exist, creates it automatically with ORG scope
- All memory operations use this environment-based dataset
Example Environment Mapping
| Environment | Default Dataset | Scope | Shared With |
|---|
production | production | ORG | All agents in production |
staging | staging | ORG | All agents in staging |
dev | dev | ORG | All agents in dev |
Benefits of Environment-Based Datasets
Automatic isolation:
- Production memories stay in production
- Staging experiments don’t pollute production knowledge
- Development testing is isolated
Shared team context:
- All agents in the same environment share knowledge
- Agent A stores finding → Agent B recalls it later
- Collective learning within each environment
Zero configuration:
- No setup required
- Works out of the box
- Automatic dataset creation
Example agent workflow:
Adding Data to Datasets
Upload Files
Upload documents, markdown files, JSON, or code:
Via UI:
- Click Upload Files
- Select target dataset
- Drag files or click to browse
- Supports: Text, Markdown, JSON, Code (up to 10MB)
Add Text
Store unstructured text, notes, or documentation:
Via UI:
- Click Add Text
- Select target dataset
- Add title (optional) and content
- Click Add Note
Ingest Code
Import code repositories for analysis and knowledge extraction:
Via UI (Git URL):
- Click Ingest Code
- Select target dataset
- Enter Git repository URL
- Specify branch (default: main)
- Enable authentication if private repo
- Click Ingest from Git
Via UI (Local Files):
- Click Ingest Code → Local Files tab
- Select target dataset
- Drag code files or click to browse
- Supports: .py, .js, .ts, .go, .java, .rs, .c, .cpp (max 2MB per file)
Knowledge Processing Pipeline
When data is added to a dataset, it goes through the cognitive processing pipeline:
- Ingestion - Raw data is stored
- Embedding - Text converted to semantic vectors via LLM
- Entity Extraction - Key concepts, relationships identified
- Graph Construction - Knowledge graph created in Neo4j
- Indexing - Vectors indexed in pgvector for search
- Cognification - Background job processes data into structured knowledge
Processing Status:
- Pending: Data uploaded, waiting for processing
- Processing: Cognitive engine analyzing content
- Completed: Ready for semantic search and recall
- Failed: Processing error (check audit logs)
Monitor processing in the Background Jobs section.
Dataset Operations
List Datasets
# List all accessible datasets
kubiya cognitive dataset list
# Filter by scope
kubiya cognitive dataset list --scope org
View Dataset Details
# Get dataset information
kubiya cognitive dataset get production-runbooks
# View memories in dataset
kubiya cognitive memory list --dataset production-runbooks
Delete Dataset
# Delete dataset and all memories
kubiya cognitive dataset delete production-runbooks --confirm
Deleting a dataset is permanent and removes all associated memories, embeddings, and knowledge graph data. This action cannot be undone.
Best Practices
Naming Conventions
- Use kebab-case:
production-runbooks, security-procedures
- Include environment or scope:
prod-incidents, staging-docs
- Be descriptive:
sre-kubernetes-runbooks vs docs
Scope Selection
- Default to ORG scope for team collaboration
- Use USER scope only for truly private data
- Reserve ROLE scope for sensitive, compliance-driven content
Dataset Organization
- Group by purpose: runbooks, incidents, procedures, notes
- Separate by environment: production, staging, development
- Divide by team: sre-knowledge, devops-procedures, security-docs
Data Hygiene
- Regular cleanup of outdated memories
- Audit access logs periodically
- Document dataset purpose in description
- Use metadata consistently for filtering
Agent Memory Sharing Example
Here’s how agents share knowledge through organization-scoped datasets:
Key benefits:
- Agent B doesn’t re-solve what Agent A already fixed
- Organizational knowledge compounds over time
- Audit trail shows which agent solved what
- Team learns collectively
Multi-Tenant Security
Datasets enforce strict multi-tenancy:
- Organization isolation - Org A cannot access Org B’s datasets
- User-level filtering - USER-scoped datasets are private
- RBAC enforcement - ROLE-scoped datasets check user roles
- Audit logging - Complete history of all operations
- Data encryption - At rest and in transit
All operations automatically include organization context from the authenticated user or agent.
Next Steps