# How Investigations Work

When production issues arise, whether through alerts or direct requests, Cleric immediately begins a systematic investigation to identify root causes and provide actionable solutions.

## How Cleric Investigates

Cleric approaches each investigation systematically but adapts its workflow based on the specific problem and emerging evidence.

### Initial Assessment

When Cleric receives a problem report, it immediately gathers contextual information:

**Problem Analysis**

* Parse metadata: severity, source system, affected components
* Extract labels, tags, and dimensional data
* Identify related infrastructure and services
* Determine investigation scope

**Environmental Assessment**

* Check system status and recent changes
* Identify related issues
* Assess resource utilization and performance baselines
* Review recent deployments and configuration changes

### Analysis and Planning

Cleric develops potential explanations and creates an investigation plan, considering:

* **Infrastructure**: Resource constraints, networking, storage, node failures
* **Application**: Code changes, configuration, dependencies, scaling
* **Environmental**: External service disruptions, traffic patterns
* **Operational**: Maintenance, deployments, team changes

### Data Gathering

Cleric queries your connected integrations to gather relevant data. The investigation adapts based on emerging evidence. If initial findings suggest a database issue, Cleric pivots to database-specific queries and metrics.

### Correlation

Cleric correlates data across systems to identify relationships that might not be obvious from individual data points:

**Timeline Correlation**

* Align events across monitoring systems
* Identify causal relationships between changes and symptoms
* Map event sequences to understand failure progression

**Cross-System Pattern Recognition**

* Connect application errors with infrastructure or configuration changes
* Correlate resource exhaustion with traffic or deployment activity
* Link performance degradation with updates, dependencies, or external services

### Self-Evaluation

Before presenting conclusions, Cleric evaluates its own findings to improve accuracy:

* Temporal relationships between cause and effect
* Consistency with known system behavior
* Quality and quantity of supporting evidence
* Alignment with historical patterns

This helps Cleric avoid presenting weak correlations as root causes.

### Results

Investigations conclude with structured results that provide clear, actionable insights. Throughout the process, Cleric maintains detailed logs of all tool executions, reasoning steps, and discoveries to provide full transparency.

**Example Investigation Result:**

{% code overflow="wrap" %}

```
**Root Cause:** Node worker-node-3 disk space exhaustion caused by increased logging from checkout-service deployment

**Timeline:**
• Jan 15, 3:00 PM: checkout-service v2.4.0 deployed with debug logging enabled
• Jan 18, 1:45 PM: Disk usage reached 95% on worker-node-3
• Jan 18, 2:00 PM: Node entered NotReady state
• Jan 18, 2:15 PM: Pod evictions began

**Impact:**
• 8 pods evicted from worker-node-3
• Affected services: checkout-service (2 pods), inventory-api (1 pod)
• No user-facing impact (remaining replicas handled traffic)

**Evidence:**
• Node logs show "disk pressure" warnings, disk at 98% capacity
• Disk usage metrics show steady increase since Jan 15
• checkout-service pods on this node produced 45GB logs in 3 days (2GB/day → 15GB/day)
```

{% endcode %}

## Key Capabilities

Through investigations, Cleric develops understanding of your environment:

* **Pattern Recognition**: Identifies recurring issues and successful resolutions
* **Contextual Analysis**: Learns system topology, naming conventions, and operational patterns
* **Multi-Platform Correlation**: Works across Kubernetes, Datadog, Prometheus, Grafana, and more

## Examples

### Resource Exhaustion

**Problem**: "Kubernetes Node NotReady"

**Sample Investigation Flow**:

1. Check node status, recent pod scheduling, and resource utilization trends
2. Analyze system-level logs and metrics for capacity constraints or hardware issues
3. Review recent deployment or scaling events that might have increased resource demand
4. Examine pod placement and resource quotas to identify constraint sources
5. Correlate findings with historical patterns and recommend specific remediation steps

**Conclusion**: Node running out of disk space due to log accumulation from a recent application update that increased logging verbosity. Includes cleanup steps and prevention recommendations.

### Application Performance Degradation

**Problem**: "API Response Time SLO Violation"

**Sample Investigation Flow**:

1. Analyze response time distribution and identify which endpoints are affected
2. Correlate timing with recent deployments, configuration changes, or traffic pattern shifts
3. Examine application logs for error patterns, database query performance, or dependency timeouts
4. Check infrastructure resources and scaling behavior during the performance degradation period
5. Provide specific recommendations based on root cause identification

**Conclusion**: Database query performance regression from recent deployment. Includes SQL optimization recommendations and rollback guidance.

## When Cleric Needs More Information

How Cleric handles uncertainty depends on how the investigation was started:

### User-Initiated Issues (Interactive)

When you start an investigation via `@Cleric` in Slack or the web UI, Cleric is interactive. Rather than running to completion in one pass, Cleric investigates, shares findings, and asks questions before continuing. This happens when:

* **Ambiguous scope**: The issue could apply to multiple services, environments, or clusters, and Cleric needs you to narrow it down
* **Missing access**: Cleric identifies a relevant data source it cannot reach and asks whether you can provide access or supply the data directly
* **Conflicting evidence**: Cleric finds evidence pointing in multiple directions and presents options for which hypothesis to prioritize
* **Environment-specific context**: Cleric needs information it can't find in your tools, such as whether a deployment was intentional or whether a team is running a load test

Reply in the same Slack thread or web chat. Cleric incorporates your answer and continues investigating.

### Alert-Triggered Investigations (Automatic)

When an alert triggers an investigation automatically, Cleric does not ask questions. It completes the investigation using available data and reports any gaps or access limitations in its findings. If critical data was inaccessible, Cleric notes this so you know what couldn't be checked.

## Alert Grouping

When a system alert fires in a Slack channel that already has a recent issue, and the new alert looks like the same problem, Cleric attaches it to that existing issue instead of creating a new one. This keeps related alerts together and avoids parallel investigations for the same issue.

Cleric groups an incoming alert with an existing issue when all of these conditions hold:

* **Same Slack channel**: The new alert and the existing issue are in the same channel.
* **Within an active window**: The existing issue received an alert recently. After enough time passes with no new alerts, Cleric treats new alerts as a separate issue.
* **Similar text**: The new alert's monitor closely matches the existing issue's. Cleric normalizes numeric values (counts, percentages, timestamps) before comparing, so two alerts that differ only in their metric values group together.

When an alert matches an existing issue, Cleric replies in the alert's Slack thread with a link to the issue ("Assigned this alert to issue \[link]") and does not create a new issue. The issue page header shows the count of grouped alerts and lists when each alert fired.

If no match is found (different channel, different content, or the most recent alert in the channel was too long ago), Cleric creates a new issue.

## When Cleric Can't Find a Root Cause

Not every investigation reaches a definitive conclusion. When Cleric can't identify a root cause:

* **Partial results**: Cleric shares what it found, even if incomplete. This might include relevant log entries, metric anomalies, or configuration changes that didn't conclusively explain the issue.
* **Access limitations**: If Cleric couldn't reach a data source due to permissions, connectivity, or missing integrations, it notes this in its response so you know what couldn't be checked.
* **Explicit uncertainty**: Cleric states when evidence is inconclusive rather than guessing. You'll see language like "could not determine" or "insufficient evidence" in the findings.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.cleric.ai/investigation/how-investigations-work.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
