ADR-007: Platform Observability Service Separation
Status
✅ Accepted
Date
2024-01-15
Context
Following the successful implementation of centralized Configuration Management (ADR-004), we identified an architectural inconsistency in how observability is handled across the Sindhan AI platform.
The Problem
"like configuration management, does the observability be separate as all components require observability? observability in the core component is specific to agent observability.. what you think?"
Current Inconsistent Architecture
- ✅ Configuration Management: Properly separated as infrastructure service (all components need it)
- ❌ Observability: Listed as one of 8 core agent components, but ALL components need observability
Two Different Observability Needs Identified
1. Platform Observability (Infrastructure Concern):
- Monitor ALL components (Identity, Memory, Context, etc.)
- System health, performance, errors, resource usage
- Used by operations teams, platform engineers, SREs
- Standard metrics: uptime, response times, error rates
2. Agent Observability (Agent-Specific Concern):
- Monitor agent behavior, decisions, and intelligence
- Agent learning progress, decision quality, behavioral patterns
- Used by AI researchers, business analysts, agent developers
- AI-specific metrics: decision accuracy, learning curves, goal achievement
The Architectural Inconsistency
Decision
Separate observability into two distinct concerns:
- Create Platform Observability Service as infrastructure component
- Rename current Observability to Agent Observability as agent-specific component
- Apply same pattern as Configuration Management for consistency
New Architecture
Rationale
Why This Separation is Correct
- Architectural Consistency: Follows same pattern as Configuration Management
- Separation of Concerns: Infrastructure vs. domain-specific concerns
- Eliminate Duplication: Single platform observability implementation
- Clear Responsibilities: Each system optimizes for its domain
- Operational Excellence: Unified monitoring across all components
Cross-Cutting Concern Principle
Established Principle: "Cross-cutting concerns that ALL components need should be infrastructure services, not component capabilities"
- ✅ Configuration Management: All components need configuration → Infrastructure service
- ✅ Platform Observability: All components need monitoring → Infrastructure service ← NEW
Domain-Specific vs. Infrastructure
Platform Observability Service (Infrastructure):
- Who Uses: Operations teams, SREs, platform engineers
- What Monitors: Component health, performance, system metrics
- How: Standard metrics, logs, traces across ALL components
- Examples:
component_requests_per_second,component_error_rate,component_memory_usage
Agent Observability Component (Domain-Specific):
- Who Uses: AI researchers, business analysts, agent developers
- What Monitors: Agent intelligence, behavior, learning, decisions
- How: AI-specific analytics and behavioral tracking
- Examples:
agent_decision_accuracy,agent_learning_progress,agent_goal_achievement
Consequences
Positive Outcomes
✅ Architectural Consistency: Platform Observability follows Configuration Management pattern ✅ Clear Separation: Infrastructure concerns separated from domain concerns ✅ No Duplication: Single observability implementation shared by all components ✅ Operational Excellence: Unified monitoring and alerting across platform ✅ Domain Optimization: Agent Observability optimized for AI-specific concerns ✅ Better Maintainability: Each system focused on its specific domain
Platform Observability Service Benefits
Standardization:
- Consistent metrics across all components
- Unified logging and tracing
- Standard health check patterns
- Common alerting and dashboard infrastructure
Performance:
- Optimized collection and aggregation
- Efficient storage and querying
- Minimal overhead on components
- Scalable monitoring infrastructure
Operations:
- Single pane of glass for system health
- Unified alerting and incident response
- Consistent troubleshooting procedures
- Centralized monitoring configuration
Agent Observability Component Benefits
AI-Specific Focus:
- Specialized for agent intelligence monitoring
- AI/ML specific metrics and analytics
- Agent learning and adaptation tracking
- Decision quality and effectiveness measurement
Business Value:
- Track agent ROI and effectiveness
- Monitor goal achievement and performance
- Analyze agent behavioral patterns
- Predict agent capabilities and limitations
Implementation Details
Platform Observability Service Responsibilities
Core Services:
Metrics Collection:
- Component performance metrics
- System resource utilization
- Request/response patterns
- Error rates and patterns
Logging Infrastructure:
- Centralized log aggregation
- Log parsing and indexing
- Log retention and archival
- Log-based alerting
Distributed Tracing:
- Request tracing across components
- Performance bottleneck identification
- Dependency mapping
- Transaction flow analysis
Health Monitoring:
- Component health checks
- Service availability monitoring
- Dependency health tracking
- Automated failure detection
Alerting & Dashboards:
- Real-time alerting rules
- Operational dashboards
- Performance visualization
- Incident management integrationComponent Integration Pattern
// Every component integrates with Platform Observability Service
impl ComponentName {
pub async fn new(
config_service: Arc<ConfigurationService>,
platform_observability: Arc<PlatformObservabilityService>, // ← NEW
) -> Result<Self> {
// Register with platform observability
platform_observability.register_component(ComponentInfo {
name: "component-name",
instance_id: self.instance_id.clone(),
health_endpoint: "/health",
metrics_endpoint: "/metrics",
version: "1.0.0",
}).await?;
// Component initialization...
Ok(self)
}
async fn process_request(&self, request: Request) -> Result<Response> {
// Platform observability happens automatically via middleware
let span = platform_observability.start_span("component.process_request");
// Component logic...
let response = self.handle_request(request).await?;
span.finish();
Ok(response)
}
}Agent Observability Component Focus
// Agent Observability focuses on AI-specific intelligence
impl AgentObservability {
pub async fn record_decision(&self, record: DecisionRecord) -> Result<()> {
// Track decision-making intelligence
self.decision_analytics.record(record).await?;
// Update learning progress metrics
self.learning_tracker.update_progress(&record).await?;
// Analyze behavioral patterns
self.behavior_analyzer.analyze_decision(&record).await?;
Ok(())
}
pub async fn track_goal_achievement(&self, goal: Goal, outcome: Outcome) -> Result<()> {
// Agent-specific goal tracking
self.goal_tracker.record_outcome(goal, outcome).await?;
// Update agent effectiveness metrics
self.effectiveness_metrics.update(goal, outcome).await?;
Ok(())
}
}Changes Made
Navigation Updates
Updated Component List:
- ❌ Removed: "👁️ Observability" from core components
- ✅ Added: "👁️ Agent Observability" to core components
- ✅ Added: "📊 Platform Observability" to infrastructure services
Documentation Structure
New Documentation Organization:
📐 Base AI Agent Architecture
├── 🔧 Component Deep Dive
│ ├── 🆔 Agent Identity
│ ├── 🧠 Memory Systems
│ ├── 🔗 Context Management
│ ├── 🌍 Environment Awareness
│ ├── 👁️ Agent Observability ← RENAMED
│ ├── 🛠️ Tools (MCP)
│ ├── 🤝 Agent Interface
│ └── 🔒 Security & Privacy
⚙️ Infrastructure Services
├── ⚙️ Configuration Management
└── 📊 Platform Observability ← NEWContent Updates
Agent Observability Component:
- Focused content on AI-specific monitoring
- Removed infrastructure monitoring concerns
- Added agent intelligence tracking
- Emphasized decision-making and learning analytics
Platform Observability Service:
- New comprehensive documentation
- Infrastructure monitoring patterns
- Component integration guidelines
- Operational procedures and best practices
Follow-up Actions
Completed
- ✅ Created ADR-007 documenting the decision and rationale
- ✅ Updated navigation structure to reflect new organization
- ✅ Renamed Observability to Agent Observability in component list
- ✅ Updated architectural diagrams to show separation
Documentation Updates Required
- 📋 Create Platform Observability Service documentation
- 📋 Update Agent Observability content to focus on AI-specific concerns
- 📋 Update Base AI Agent Architecture overview
- 📋 Update all architectural diagrams to reflect new structure
Implementation Requirements
- 📋 Design Platform Observability Service architecture
- 📋 Create component integration patterns for platform observability
- 📋 Refactor Agent Observability to focus on AI-specific metrics
- 📋 Establish monitoring standards across all components
Lessons Learned
- Consistency Matters: Architectural patterns should be applied consistently across the platform
- User Insights Drive Better Architecture: The question revealed an important inconsistency
- Cross-Cutting Concerns: Clear criteria for what should be infrastructure vs. component-level
- Domain Boundaries: Different types of observability serve different purposes and users
Impact on Overall Architecture
Infrastructure Services
- Configuration Management: All components get configuration
- Platform Observability: All components get monitoring ← NEW
Agent Components
- Reduced to 7 core components (from 8)
- Cleaner separation between infrastructure and domain concerns
- More focused component responsibilities
Operational Benefits
- Unified Monitoring: Single observability service for all infrastructure concerns
- Specialized Intelligence: Agent Observability focused on AI-specific insights
- Better Scalability: Each system optimized for its specific use case
Related Decisions
- ADR-004: Centralized Configuration Management (established the pattern)
- ADR-005: Component-Specific Configuration Handling (similar separation principle)
- ADR-002: Agent Identity Scope Limitation (proper component scoping)
This decision brings architectural consistency to the platform by applying the same separation of concerns principle that proved successful with Configuration Management, resulting in a cleaner, more maintainable, and operationally excellent architecture.