What Observation Means

Observation is not just monitoring. Not "is the system up?" or "is it fast enough?" Observation is the act of looking at what happened and asking: what did we learn? what does this outcome mean? what should we do next?

Most teams already have operational observability. Dashboards glow green. Alerts fire when latency spikes. Uptime percentages decorate quarterly slides. This is monitoring, and it answers a narrow set of operational questions:

These are necessary but insufficient. They tell you whether the machine is running. They do not tell you whether the machine is doing the right thing.

Intent observation answers a fundamentally different category of question — product and learning questions that connect execution back to purpose:

The difference is not semantic. Monitoring watches the system. Observation watches the learning. Monitoring answers "is it running?" Observation answers "is it working?" — where "working" means advancing understanding, reducing uncertainty, and generating new insight.

Here is a concrete example. Imagine you executed a spec to "improve signal capture latency from 5 seconds to 2 seconds." The monitoring dashboard would show a latency graph dropping. Green light, move on. But the observation dashboard reveals something richer:

Observation Example: Latency Improvement Spec

What happened: Actual latency is now 1.8 seconds (exceeded the 2-second goal by 10%).

The deeper insight: New signal count increased 40%. Signals that previously timed out during the 5-second window are now being captured. Cost per signal decreased 20% because infrastructure utilization improved — fewer retries, fewer dropped messages, less wasted compute.

The learning: One team member notices a pattern in the newly captured signals: "most of the previously missed signals come from high-frequency sensors (>100 Hz) that were timing out during peak hours on the global message queue."

The next signal: "High-frequency sensors need dedicated infrastructure. Propose a separate message queue tier for sensors exceeding 100 Hz to prevent peak-hour contention."

The observation reveals not just that the spec worked (latency target met), but why it mattered (new signals unlocked that were previously invisible) and what it enables (a targeted infrastructure optimization opportunity). That insight becomes the next signal in the queue. The loop closes.

This is the essential difference: monitoring is retrospective accounting. Observation is prospective learning. Monitoring says "the thing happened." Observation says "here is what the thing taught us, and here is what we should do next."

The Loop Closure Mechanism

Observation closes the feedback loop. Without it, execution is a dead end — work completes, results accumulate in dashboards nobody reads, and the team moves to the next item on the backlog without absorbing what just happened. With observation, every execution becomes a source of new understanding.

Here is how data flows from execution to insight to new intent:

graph TB Execute["Execution
Contract completes
Assertions verified"] Events["Events Emitted
OTel traces
Metrics, logs"] Collector["OTel Collector
Ingest, process
Filter attributes"] Storage["Storage Backends
Tempo traces
Mimir metrics
Loki logs"] Dashboard["Grafana Dashboard
Cycle time, traces
Trust scores"] Human["Human Observes
Reads dashboard
Asks 'what did we learn?'"] Pattern["Pattern Recognized
Insight discovered
'What is next?'"] Signal["Signal Created
Recorded in Git
Becomes spec candidate"] Execute -->|"Events flow"| Events Events -->|"Telemetry"| Collector Collector -->|"Enriched data"| Storage Storage -->|"Query and visualize"| Dashboard Dashboard -->|"Observe"| Human Human -->|"Interpret"| Pattern Pattern -->|"Formalize"| Signal Signal -->|"Back to"| Execute style Execute fill:#1a1a2e,stroke:#f59e0b,stroke-width:2px,color:#f1f5f9 style Events fill:#1a1a2e,stroke:#3b82f6,stroke-width:2px,color:#f1f5f9 style Collector fill:#1a1a2e,stroke:#3b82f6,stroke-width:2px,color:#f1f5f9 style Storage fill:#1a1a2e,stroke:#10b981,stroke-width:2px,color:#f1f5f9 style Dashboard fill:#1a1a2e,stroke:#8b5cf6,stroke-width:2px,color:#f1f5f9 style Human fill:#1a1a2e,stroke:#ec4899,stroke-width:2px,color:#f1f5f9 style Pattern fill:#1a1a2e,stroke:#f59e0b,stroke-width:2px,color:#f1f5f9 style Signal fill:#1a1a2e,stroke:#8b5cf6,stroke-width:2px,color:#f1f5f9

Source: loop-closure.mermaid

The loop is mechanical but the insight is human. Execution is deterministic — the spec runs or it does not. Events flow automatically through OpenTelemetry instrumentation. Storage scales horizontally across Grafana's backend trifecta: Tempo for distributed traces, Mimir for metrics, Loki for logs. The collector enriches and filters, adding context attributes that make later queries meaningful.

But observation — the moment a human reads the dashboard and thinks "aha, I see the pattern" — that is where value is created. The system provides the data. The human provides the meaning. And the meaning, formalized as a new signal and recorded in Git, becomes the input for the next cycle.

This is not a metaphor. The loop is literal. Every signal has a source field. When that source is observation, you can trace the lineage: this signal was born because someone looked at a dashboard after an execution and noticed something worth investigating. The provenance chain is unbroken.

What the Dashboard Reveals

The Observe dashboard is not a status page. It is a thinking tool. It answers five critical questions about system health, agent capability, and learning velocity. Each question reveals a different dimension of how the system is evolving.

"How long does it take?"

Cycle Time Histogram

Cycle time measures the interval from signal.captured to contract.completed. The histogram shows you the distribution: are most specs clustered around 5 seconds? Do some outliers take 60+ seconds? Look for bimodal distributions — they often reveal two distinct classes of work being routed through the same pipeline.

A shift in the distribution is the clearest evidence that observation led to improvement. If the median was 10 seconds last week and 2 seconds this week, something changed. The dashboard does not just show you the number — it shows you the shape of the change.

"Where do signals come from?"

Source Distribution

Signals originate from five tiers: manual human observation (Tier 1), execution events (Tier 2), automated clustering (Tier 3), feedback loop signals — observations of previous observations (Tier 4), and external integrations (Tier 5).

A healthy system has diversity across tiers. If 95% of signals are Tier 1 (manual), your automation is missing opportunities. If 0% are Tier 1, you have lost human judgment entirely. The ideal distribution shifts over time: early systems are heavily manual, mature systems have strong Tier 2–4 contribution while retaining meaningful Tier 1 input.

"How much can agents do alone?"

Trust Score Distribution (L0–L4)

Trust scores reveal agent capability across five autonomy levels: L0 (no autonomy, human review required), L1 (simple routing), L2 (guided decisions with templates), L3 (complex execution with interpretation), and L4 (full autonomy including opportunity identification).

Most specs start at L1–L2. As agents prove themselves through successful execution and validated contract assertions, they progress to L3–L4. The distribution tells you where your agents are operating. Movement rightward means increasing trust. Regression means something broke — investigate immediately.

"What's failing?"

Contract Assertion Pass/Fail Rate

Every spec includes explicit contract assertions: "signal count > 1000", "latency < 2s", "backward compatibility maintained." When a contract fails, the dashboard shows it in red. But failure is not the end of the story — it is the beginning of the next observation.

When a contract fails, observation must ask: Why? Was the spec poorly defined? Was the environment different than expected? Is the assertion too strict, or is the system genuinely degraded? Failed contracts are learning opportunities, not incidents. The failure rate trend matters more than any individual failure.

"What patterns are emerging?"

Signal Clustering Trends

Patterns emerge over time as signals cluster around themes. One week, all signals cluster around "latency." The next week, a new cluster emerges: "trust score plateau." The week after, a cross-cutting pattern appears: "latency improvements correlate with trust score increases."

A growing cluster count indicates increasing observational sophistication — the system is learning to see finer distinctions. A stable count suggests a plateau — either the system has reached equilibrium or observation has become routine. A declining count may indicate overfitting: the system is collapsing distinct signals into too-broad categories.

From Observation to New Signals

The power of observation is not in seeing — it is in what seeing produces. Every observation that generates a new signal closes the loop. Here are three concrete examples of how execution outcomes flow through observation into new intent.

Example 1: Latency Observation → Infrastructure Signal

Execution: A spec to improve signal capture latency was executed. The contract target was 2 seconds. Actual result: latency dropped from 5 seconds to 1.8 seconds.

Observation: The dashboard revealed more than a latency improvement. New signals were being captured that previously timed out. The distribution shift was dramatic: formerly 30% of signals timed out during processing, now less than 5%. The timeout cluster was almost entirely high-frequency IoT sensors exceeding 100 Hz.

Insight: "Most missed signals come from high-frequency IoT sensors that contend on the global message queue during peak hours. The latency improvement unmasked a capacity problem that was previously invisible because those signals never made it to the dashboard."

Signal Created: "Propose dedicated queue infrastructure for high-frequency sensor streams to eliminate peak-hour contention."

Next Spec: infrastructure-spec.md — Provision a separate Kafka topic for high-frequency signals (>100 Hz), implement sensor-type routing logic at the collector level, validate with load test simulating peak-hour volumes.

Example 2: Agent Execution → Autonomy Evolution Signal

Execution: An agent executing a spec to "refactor signal schema for v2 compatibility" completed successfully. Contract assertion passed: backward compatibility maintained, all existing consumers continued to function without modification.

Observation: The dashboard trust score for this agent increased from L2 (guided decisions) to L3 (complex execution). During the refactoring, the agent made 5 schema design decisions autonomously — field naming conventions, nullable handling, enum expansion strategy, index optimization, and migration path selection. All five decisions were validated correct in post-execution review.

Insight: "This agent is ready for L3 autonomy on schema-related specifications. Trust score is trending upward with zero regressions across the last 8 executions. The decisions it made during this spec demonstrate judgment, not just pattern-matching."

Signal Created: "Consider delegating spec interpretation for schema-domain work directly to this agent without human pre-review."

Next Spec: agent-autonomy-spec.md — Update agent routing rules to assign schema-related specs directly to this agent at L3 trust. Add a post-execution validation gate (rather than pre-execution review) to maintain safety while increasing throughput.

Example 3: Cost Observation → Optimization Signal

Execution: No single spec triggered this observation. Instead, the dashboard revealed a trend: per-execution cost increased 40% over two weeks. The root cause was visible in the metrics breakdown — Claude API token usage grew because more Opus calls were being made relative to Haiku calls.

Observation: The agent responsible for complexity classification was routing too conservatively. Specs that should have been classified as "simple" (appropriate for Haiku-class models) were being classified as "complex" (routed to Opus). The false-positive rate on complexity classification had drifted from 8% to 35%.

Insight: "The complexity classification prompt is too conservative. It lacks concrete examples of what constitutes 'simple' vs. 'complex' in our domain. Without exemplars, the agent defaults to caution — which is safe but expensive."

Signal Created: "Refine agent complexity classification prompt with domain-specific examples. Add signal cluster exemplars to reduce false positives on complexity assessment."

Next Spec: prompt-refinement-spec.md — Iterate the agent system prompt with 10 concrete examples spanning the simple/complex boundary. Measure reduction in Opus over-allocation. Target: restore false-positive rate to below 10% within one week.

Each example follows the same pattern: execution produces data, observation extracts meaning, meaning becomes a new signal, and the signal drives the next specification. The loop is not aspirational — it is structural. Every piece of the chain is traceable, every insight is recorded, and every new signal carries the provenance of the observation that created it.