# Common Use Cases

> **New to Cleric?** Start with the [Quick Start Guide](/meet-cleric/quick-start.md) to run your first investigation, then return here for more examples.

This guide provides a comprehensive library of prompts showing various ways to use Cleric. In addition to automated alert triage, use Cleric to troubleshoot production issues, run proactive health scans, analyze observability data, or assist with incident retrospectives.

## Quick Tips

* **Be specific**: Help Cleric scope its investigation by including details like time windows (and timezones!), identifiers, and resource names
* **Describe the symptom**: Include what users are seeing and any error messages, and attach images when appropriate
* **Ask for evidence**: Request links to dashboards, log queries, and traces and ask Cleric to cite its sources
* **Follow up**: Ask for additional hypotheses, further investigation, and remediation proposals

## Example Investigation Flow

Here's what a complete investigation looks like:

{% code overflow="wrap" %}

```
@Cleric Users are reporting that checkout is failing intermittently since 2pm CET with 503s. Check the GCP logs and correlate with any recent deployments.
```

{% endcode %}

**Cleric's response:**

{% code overflow="wrap" %}

```
**Root Cause:** Database connection pool exhaustion in checkout-service

**Timeline:**
- 2:00 PM CET: Traffic increased 3.5x due to marketing campaign launch
- 2:15 PM CET: Connection pool reached 100% utilization (10/10 connections)
- 2:15 PM CET: First 503 errors appeared in logs

**Evidence:**
• GCP Logs show "waiting for available database connection" warnings
• Connection pool metrics show sustained saturation since 2:15 PM (10/10 connections)
• Error rate: 12% of checkout requests (847 failures / 7,058 requests)
```

{% endcode %}

**Follow-up conversation:**

{% code overflow="wrap" %}

```
What's the typical connection pool usage outside of incidents like this?
```

{% endcode %}

{% code overflow="wrap" %}

```
Last 30 days:
• Baseline: 4-6 connections (40-60% of pool)
• Peak hours (9am-5pm CET): 7-8 connections (70-80%)
• Off-hours: 2-3 connections (20-30%)

Today: 10/10 sustained (100% utilization, unusual)
```

{% endcode %}

## Customizing Your Prompts

You can modify any prompt by specifying:

| Dimension         | Examples                                                                |
| ----------------- | ----------------------------------------------------------------------- |
| **Data source**   | "using Prometheus...", "check Datadog for...", "search GCP logs for..." |
| **Scope**         | "for checkout-api", "in the UAT environment", "in the us-west region"   |
| **Time window**   | "in the last 4 hours", "since 2pm PST", "between Jan 15 2-4pm UTC"      |
| **Output format** | "with evidence links", "as a timeline", "summarize in bullet points"    |

## Investigate Issues

Understand what's happening when something goes wrong.

### Incident/bug investigation

{% hint style="info" %}
Prefix any prompt with `!investigate` to trigger a full deep investigation instead of conversational mode. See [Deep Investigations](/investigation/collaborating-with-cleric.md#deep-investigations-with-investigate).
{% endhint %}

{% code overflow="wrap" %}

```
Users are reporting that checkout is failing intermittently since 2pm CET with 503s. Check the GCP logs and correlate with any recent deployments.
```

{% endcode %}

{% code overflow="wrap" %}

```
Look for logs in the last 24h related to app installations in the canary instance and tell me if you see anything indicating a failure. The user reports seeing an error about a missing field.
```

{% endcode %}

### Service health

{% code overflow="wrap" %}

```
Check payments health since 9am PST, include evidence links.
```

{% endcode %}

<details>

<summary>Example response</summary>

{% code overflow="wrap" %}

```
Payments service health check (since 9:00 AM PST):

**Status:** ✅ Healthy

**Metrics:**
• Uptime: 100% (no restarts or crashes)
• Error rate: 0.08% (3 errors / 3,847 requests) - within normal range
• P95 latency: 245ms (baseline: 220-250ms)
• Memory usage: 1.2GB / 2.0GB (60% - normal)

**Recent Activity:**
• No deployments since yesterday 3:15 PM PST
• Traffic volume: 15% above baseline (expected for Monday morning)

**Alerts:** None active

**Evidence:** [Datadog dashboard →] | [Error logs →] | [Metrics →]

Everything looks healthy. The 3 errors were timeout exceptions from the external payment gateway (Stripe) and were automatically retried successfully.
```

{% endcode %}

</details>

{% code overflow="wrap" %}

```
Are there any open heartbeat alerts for the marketing site?
```

{% endcode %}

### Debugging errors

{% code overflow="wrap" %}

```
Debug why Prometheus in the platform cluster has unhealthy ingesters since 10am EST
```

{% endcode %}

<details>

<summary>Example response</summary>

{% code overflow="wrap" %}

```
**Root Cause:** Disk space exhaustion on ingester nodes

**Timeline:**
• 9:45 AM EST: Disk usage reached 95% on ingester-2 and ingester-3
• 10:00 AM EST: Ingesters marked as unhealthy by Prometheus
• 10:15 AM EST: Write requests began failing with "insufficient disk space" errors

**Impact:**
• Write path degraded: 40% of metrics samples dropped
• Read path operational: Query performance unaffected
• Data loss: ~2 hours of metrics for affected series
• Node count: 3 ingesters, 2 unhealthy

**Evidence:**
• Ingester logs show "no space left on device" errors, disk at 98% utilization
• WAL segment size increased 300% in last 7 days due to high cardinality metrics
```

{% endcode %}

</details>

{% code overflow="wrap" %}

```
Debug why the KEDA autoscaler has been throwing errors for the last 4 hours
```

{% endcode %}

### Logs & metrics analysis

{% code overflow="wrap" %}

```
Using GCP logs, get the most common errors across all our customers in the last day.
```

{% endcode %}

{% code overflow="wrap" %}

```
Search Datadog logs for any errors related to user 3bc96ec9-9367-4e40-8673-6a2fc9ceb87b
```

{% endcode %}

{% code overflow="wrap" %}

```
Check the last 7 days of error logs for the acme instance. Group the logs into top 3 error patterns with counts and first/last seen and print it as a markdown table.
```

{% endcode %}

<details>

<summary>Example response</summary>

{% code overflow="wrap" %}

```
Top 3 error patterns (last 7 days):

| Error Pattern | Count | First Seen | Last Seen | Severity |
|--------------|-------|------------|-----------|----------|
| `ConnectionPoolTimeoutException: Could not acquire connection from pool` | 1,247 | Jan 23, 08:15 AM | Jan 29, 11:42 PM | ERROR |
| `ValidationError: Missing required field 'customer_id' in request` | 892 | Jan 24, 02:30 PM | Jan 30, 09:15 AM | ERROR |
| `HTTPException: 503 Service Unavailable from pricing-api` | 456 | Jan 26, 10:00 AM | Jan 29, 06:30 PM | ERROR |

**Analysis:**
• Connection pool errors: steady increase (avg 178/day)
• Validation errors: spiked Jan 27 (402 occurrences), then decreased
• 503 errors from pricing-api: intermittent, clustered around peak hours
```

{% endcode %}

</details>

{% code overflow="wrap" %}

```
Show the top 10 services with highest error rate increase vs baseline since midnight
```

{% endcode %}

### CI/CD failures

{% code overflow="wrap" %}

```
Investigate why the last 3 web-app pipeline runs failed at the same post-deploy step waiting for the new version to come up.
```

{% endcode %}

## Proactive Scans

Run fleet-wide scans to catch issues before they become incidents, or to improve observability data hygiene.

### Warning spike scan

{% code overflow="wrap" %}

```
Run a proactive system health check. Find services with a spike in WARNING logs as compared to baseline. List the warning patterns and whether they correlate with any recent deploys.
```

{% endcode %}

<details>

<summary>Example response</summary>

{% code overflow="wrap" %}

```
3 services with elevated WARNING logs:

**1. inventory-api** (+245% vs baseline)
• Warning pattern: "Cache miss rate above threshold: 45%"
• Baseline: 12 warnings/hour → Current: 41 warnings/hour
• Correlation: Deployed v2.8.0 on Jan 28 at 2pm (Redis cache configuration changed)

**2. notification-service** (+180% vs baseline)
• Warning pattern: "Retry queue depth exceeds 1000 messages"
• Baseline: 5 warnings/hour → Current: 14 warnings/hour
• Correlation: No recent deployments (last deploy: Jan 20)

**3. analytics-service** (+120% vs baseline)
• Warning pattern: "Query execution time exceeded 5s threshold"
• Baseline: 8 warnings/hour → Current: 18 warnings/hour
• Correlation: No deployments, data volume increased 3x on Jan 27

**Summary:**
• 1 of 3 spikes correlates with deployment
• 2 spikes related to external factors (email service, data volume)
• No critical issues detected
```

{% endcode %}

</details>

### Log clean up

{% code overflow="wrap" %}

```
Identify the top 5 services that produced the most logs over the last 7 days. Propose recommendations of changes we can make to reduce log volumes.
```

{% endcode %}

## Incident Retrospectives

Get help answering retrospective questions or building post-incident timelines.

### Structured retrospective

{% code overflow="wrap" %}

```
Use the attached conversation log to produce a structured retrospective report. Establish start time and mitigation time, build an evidence-based timeline, and identify contributing factors. Include evidence links.
```

{% endcode %}

### Timeline construction

{% code overflow="wrap" %}

```
Help me build a timeline for the incident yesterday between 2pm and 4pm PST.
```

{% endcode %}

<details>

<summary>Example response</summary>

{% code overflow="wrap" %}

```
**Incident Timeline: Jan 29, 2pm - 4pm PST**

**14:00 - Detection**
• PagerDuty alert: "High API Error Rate - checkout-service" [Alert →]
• User reports begin in #support channel (5 reports of checkout failures)

**14:05 - Initial Response**
• On-call engineer acknowledges alert
• Error rate: 15% (baseline: <0.5%)
• P95 latency: 3.2s (baseline: 450ms)

**14:12 - Investigation Begins**
• Checkout-service logs show database connection timeouts
• Connection pool: 30/30 connections in use (100% utilization)
• No recent deployments to checkout-service

**14:18 - Root Cause Identified**
• Database primary experiencing high CPU (95% utilization)
• Slow query detected: Complex JOIN query from analytics-service
• Query running since 13:58 (14 minutes before incident)

**14:25 - Mitigation Started**
• Analytics team contacted
• Long-running query killed on database primary
• Connection pool begins recovering

**14:32 - Service Recovery**
• Error rate drops to 2%
• Latency returns to baseline (480ms)
• Connection pool utilization: 45%

**14:45 - Full Resolution**
• Error rate: <0.5% (normal)
• All services healthy
• User reports cease

**15:30 - Post-Incident Actions**
• Query timeout added to analytics-service (30s limit)
• Database query monitoring alerts configured
• Incident review scheduled for Jan 31

**Duration:** 45 minutes (detection to resolution)
**Impact:** ~15% of checkout transactions failed during incident window
```

{% endcode %}

</details>

{% code overflow="wrap" %}

```
What were the contributing factors to the web-app outage last Tuesday?
```

{% endcode %}

## Retrieve Knowledge

Query your documentation, your codebase, or Cleric's knowledge.

### Documentation lookups

{% code overflow="wrap" %}

```
What does our GitOps Runbook say about the debugging flow?
```

{% endcode %}

{% code overflow="wrap" %}

```
Search Confluence for the on-call escalation process.
```

{% endcode %}

### Codebase exploration

{% code overflow="wrap" %}

```
Investigate under what circumstances a `user_created` event would be tagged with `source` of "system unknown"
```

{% endcode %}

### Issue/tenant lookup

{% code overflow="wrap" %}

```
What tenant does issue_id 4c4d742c-2532-4941-8106-7ba112b7fa92 belong to?
```

{% endcode %}

### Tool capabilities

{% code overflow="wrap" %}

```
Which clusters are visible to you when using the datasource 'logs-nonprod'?
```

{% endcode %}

{% code overflow="wrap" %}

```
What can you do with your AWS credentials?
```

{% endcode %}

### Memory queries

{% code overflow="wrap" %}

```
Search your memories for how to identify a tenant from a user id.
```

{% endcode %}

{% code overflow="wrap" %}

```
What do you know about the payments service?
```

{% endcode %}

<details>

<summary>Example response</summary>

{% code overflow="wrap" %}

```
Here's what I remember about the payments service:

**Service Details:**
• Owner: Payments team (#payments-oncall for escalations)
• Primary dependencies: Stripe API, inventory-api, pricing-api
• Database: Uses payments_db with read replicas in us-east and us-west

**Known Behaviors:**
• Memory usage spikes to ~2.5GB during month-end batch processing (expected behavior)
• Read replica lag >500ms triggers automatic fallback to primary database
• Timeout threshold for Stripe API calls: 10 seconds

**Troubleshooting Guidance:**
• When investigating latency: Check read replicas first, then Stripe API status
• When investigating errors: 90% of payment failures are due to external gateway issues, not our service
• Recent deployments: Always check if inventory-api was also deployed (frequent cause of integration issues)
```

{% endcode %}

</details>

## Remediate & Handoff

Take action on issues: summarize findings, create tickets, or open pull requests.

### Summarization

{% code overflow="wrap" %}

```
Generate a detailed summary of this issue, including a mermaid dataflow diagram of the current implementation.
```

{% endcode %}

### Investigation diagram

{% code overflow="wrap" %}

```
Create a mermaid diagram representing the path of your investigation. If existing, dead-ends should be represented and the main "golden" path (from alert to conclusion) should be prominent.
```

{% endcode %}

### Create tickets

{% code overflow="wrap" %}

```
Create a GitHub issue that summarizes the problem and potential solutions.

Create a Jira ticket in project SRE with the root cause and remediation steps.
```

{% endcode %}

### Open pull requests

{% code overflow="wrap" %}

```
Add debug logs to this flow and create a draft PR for me to review.
```

{% endcode %}

## Monitoring Production

Keep an eye on production systems and verify changes are working as expected.

### Verify a fix

{% code overflow="wrap" %}

```
The memory leak fix for auth-service was deployed at 3pm. Compare memory usage before and after the deploy.
```

{% endcode %}

{% code overflow="wrap" %}

```
We rolled back the payments-api to the previous version. Verify that the 500 errors have stopped.
```

{% endcode %}

### Scheduled follow-ups

Ask Cleric to check back after a delay — useful when you've pushed a fix and want to verify it later without watching dashboards yourself.

{% code overflow="wrap" %}

```
I just deployed a fix for the connection pool issue. Check the error rate for checkout-service in 30 minutes and let me know if it's back to normal.
```

{% endcode %}

{% code overflow="wrap" %}

```
Check the error logs for payments-api again in an hour and tell me if the timeout errors have stopped since the deploy at 2pm.
```

{% endcode %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.cleric.ai/investigation/common-use-cases.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
