Data Structure

OstrichDB uses a hierarchical data structure that provides intuitive organization, efficient querying, and natural user isolation. This document explains how data is organized, stored, and accessed.

Hierarchical Organization

Overview

The OstrichDB hierarchy consists of four levels:

Projects (User isolation & top-level organization)
  └── Collections (Logical data groupings, can be encrypted)
      └── Clusters (Record groupings for organization)
          └── Records (Individual data items with types & values)

Level Purposes

Projects

User isolation: Each user has their own project namespace
Top-level organization: Logical separation of different applications or use cases
Access control: Projects define security boundaries
Resource management: Quotas and limits applied at project level

Collections

Data grouping: Related data organized together (e.g., “users”, “products”, “orders”)
Encryption boundary: Collections can be encrypted independently
Schema flexibility: Each collection can have different data organization
Logical separation: Different data types or business entities

Clusters

Record organization: Group related records together (e.g., “active_users”, “archived_orders”)
Query optimization: Searches can target specific clusters
Performance tuning: Data locality for related records
Logical subsets: Further categorization within collections

Records

Data storage: Individual data items with names, types, and values
Type enforcement: Each record has a specific data type
Atomic operations: Records are the smallest unit of data manipulation
Metadata: Each record includes creation/modification timestamps

File System Structure

Physical Layout

data/
├── projects/
│   ├── project1/
│   │   ├── metadata.json
│   │   └── collections/
│   │       ├── users/
│   │       │   ├── metadata.json
│   │       │   └── clusters/
│   │       │       ├── active_users/
│   │       │       │   ├── metadata.json
│   │       │       │   └── records/
│   │       │       │       ├── username.json
│   │       │       │       ├── email.json
│   │       │       │       └── age.json
│   │       │       └── archived_users/
│   │       │           ├── metadata.json
│   │       │           └── records/
│   │       └── products/
│   │           ├── metadata.json
│   │           └── clusters/
│   └── project2/
└── system/
    ├── users.json
    └── config.json

Metadata Files

Project Metadata

{
  "name": "my-project",
  "owner": "user123",
  "created": "2024-01-15T10:00:00Z",
  "modified": "2024-01-15T10:00:00Z",
  "description": "Project description",
  "settings": {
    "default_encryption": false,
    "backup_enabled": true
  }
}

Collection Metadata

{
  "name": "users",
  "encrypted": true,
  "created": "2024-01-15T10:00:00Z",
  "modified": "2024-01-15T10:00:00Z",
  "record_count": 150,
  "cluster_count": 3,
  "encryption": {
    "algorithm": "AES-256",
    "key_id": "user123_master"
  }
}

Cluster Metadata

{
  "name": "active_users",
  "created": "2024-01-15T10:00:00Z",
  "modified": "2024-01-15T10:00:00Z",
  "record_count": 45,
  "size_bytes": 2048,
  "last_accessed": "2024-01-15T15:30:00Z"
}

Record File Structure

{
  "name": "username",
  "type": "STRING",
  "value": "john_doe",
  "created": "2024-01-15T10:00:00Z",
  "modified": "2024-01-15T10:00:00Z",
  "metadata": {
    "id": "rec_123456",
    "size_bytes": 64
  }
}

Data Types and Storage

Supported Types

Basic Types

STRING/STR/CHAR: Text data, stored as UTF-8
INTEGER/INT: 64-bit signed integers
FLOAT/FLT: 64-bit floating-point numbers
BOOLEAN/BOOL: True/false values

Temporal Types

DATE: Date values (YYYY-MM-DD format)
TIME: Time values (HH:MM:SS format)
DATETIME: Combined date and time (ISO 8601 format)

Special Types

UUID: Universally unique identifiers
NULL: Null/empty values
CREDENTIAL: Encrypted credential storage

Array Types

Arrays of any basic type:

[]STRING, []INTEGER, []FLOAT, []BOOLEAN
[]DATE, []TIME, []DATETIME, []UUID

Type Storage Examples

// String record
{
  "name": "username",
  "type": "STRING",
  "value": "alice_smith"
}

// Integer record
{
  "name": "age",
  "type": "INTEGER",
  "value": 28
}

// Array record
{
  "name": "tags",
  "type": "[]STRING",
  "value": ["admin", "active", "verified"]
}

// Date record
{
  "name": "created_date",
  "type": "DATETIME",
  "value": "2024-01-15T10:30:00Z"
}

Encryption and Security

Collection-Level Encryption

Collections can be encrypted using user-specific master keys:

Encrypted Collection Structure

encrypted_collection/
├── metadata.json (unencrypted metadata)
└── clusters/
    └── cluster_name/
        ├── metadata.json (unencrypted)
        └── records/
            ├── record1.enc (encrypted record data)
            └── record2.enc (encrypted record data)

Encryption Process

Key derivation: Master key derived from user credentials
Data encryption: Record data encrypted with AES-256
Metadata preservation: Structure metadata remains unencrypted
Transparent operations: Automatic encrypt/decrypt during operations

Access Control

Project ownership: Users can only access their own projects
Collection access: Encrypted collections require proper keys
Record-level security: All operations validate user permissions

Query and Access Patterns

Hierarchical Queries

Queries follow the hierarchical structure:

/projects/{project}/collections/{collection}/clusters/{cluster}/records

Efficient Access Patterns

Direct access: Fast lookup by exact path
Hierarchical browsing: Navigate structure level by level
Filtered queries: Search within specific levels
Bulk operations: Operate on entire clusters or collections

Indexing Strategy

Path-based indexing: Fast hierarchical lookups
Type indexing: Quick filtering by data type
Name indexing: Efficient record name searches
Metadata indexing: Query by creation date, size, etc.

Performance Considerations

Storage Efficiency

Separate metadata: Metadata separated from data for faster queries
File-per-record: Individual record files for atomic operations
Hierarchical caching: Cache frequently accessed metadata
Lazy loading: Load data only when accessed

Query Performance

Path optimization: Direct path resolution without scanning
Metadata queries: Fast listing without loading record data
Filtered scans: Early termination when filters don’t match
Concurrent access: Multiple readers, controlled writers

Memory Management

Stream processing: Large results streamed rather than loaded entirely
Resource cleanup: Automatic cleanup with defer patterns
Memory pools: Reuse allocated memory for common operations
Garbage avoidance: Manual memory management eliminates GC overhead

Backup and Recovery

Backup Strategy

Hierarchical backups: Backup at any level of hierarchy
Incremental backups: Only changed data since last backup
Metadata preservation: Backup includes all metadata
Encryption aware: Encrypted data backed up encrypted

Recovery Process

Point-in-time recovery: Restore to specific timestamp
Partial recovery: Restore specific projects or collections
Consistency checks: Verify data integrity after recovery
Rollback capability: Revert changes if needed

Best Practices

Data Organization

Logical grouping: Use projects for different applications
Collection design: Group related data in collections
Cluster strategy: Use clusters for natural data subsets
Naming conventions: Use consistent, descriptive names

Performance Optimization

Access patterns: Design hierarchy around query patterns
Batch operations: Group related operations together
Encryption planning: Consider encryption overhead for sensitive data
Monitoring: Track performance metrics and query patterns

Security Guidelines

Encryption boundaries: Use collection-level encryption appropriately
Access control: Implement proper user isolation
Key management: Secure handling of encryption keys
Audit trails: Log all data access and modifications

Next Steps

Learn more about working with the data structure:

API Reference - How to interact with the hierarchy via REST API
Security - Encryption and access control details
Configuration - Database configuration options