Skip to content

Feature Request: Cloud Monitoring & Logging Skill #106

Description

@unrealandychan

Feature Request: Cloud Monitoring & Logging Skill

Summary

Production workloads on Google Cloud require robust observability — yet no existing skill covers Cloud Monitoring (metrics, alerts, SLOs, dashboards) or Cloud Logging (log routing, query syntax, cost optimization). This leaves a critical Day-2 operations gap for agents assisting with GCP deployments.

The Gap

Today, if a user asks an agent:

  • 'How do I set up alerts for my Cloud Run service?'
  • 'Why did my GKE pod crash? Where are the logs?'
  • 'How do I reduce my Cloud Logging bill?'

The agent must fall back to generic knowledge or unrelated skills. There is no single skill that documents:

  • Creating Alerting Policies via gcloud, Terraform, or the Console
  • Writing PromQL / MQL queries for Cloud Monitoring
  • Exporting logs to BigQuery or Cloud Storage for long-term retention
  • Configuring log-based metrics for custom business KPIs
  • Using Error Reporting to aggregate and track exceptions across services
  • Distributed tracing with Cloud Trace (OpenTelemetry integration)

Proposed Skill

A cloud-observability-basics skill that agents load when users mention: monitoring, alerting, logs, metrics, SLO, error reporting, tracing, Cloud Monitoring, Cloud Logging, or observability.

Suggested SKILL.md frontmatter

---
name: cloud-observability-basics
description: >
  Use when the user asks about monitoring, logging, alerting, tracing, or observability
  for Google Cloud services. Covers Cloud Monitoring (metrics, dashboards, alerting policies,
  SLOs), Cloud Logging (log routing, log-based metrics, excluded logs), Cloud Trace
  (distributed tracing, OpenTelemetry), and Error Reporting. WHEN: set up alert, create
  dashboard, view logs, reduce logging cost, trace request, debug crash, observability.
compatibility: Requires monitoring.metricWriter, logging.logWriter, and cloudtrace.agent IAM roles.
---

Key reference topics

  1. Cloud Monitoring
    • Alerting policies: metric thresholds, uptime checks, log-based alerts
    • Dashboards: JSON model, MQL vs PromQL
    • SLOs: defining SLI, error budget, burn rate alerts
    • Custom metrics: OpenCensus / OpenTelemetry instrumentation
  2. Cloud Logging
    • Log Explorer query syntax (LogQL), regex, JSON subfield extraction
    • Log buckets, log views, and IAM
    • Log sinks: BigQuery, Cloud Storage, Pub/Sub export
    • Exclusion filters to control ingestion cost
    • Log-based metrics (counter & distribution)
  3. Cloud Trace
    • OpenTelemetry auto-instrumentation for Cloud Run, GKE, App Engine
    • Trace span analysis and latency debugging
  4. Error Reporting
    • Automatic exception grouping from Cloud Functions, Cloud Run, GKE
    • Notifications via email / Pub/Sub / Slack
  5. Cost Optimization
    • Logging ingestion pricing tiers
    • When to use _Default vs custom log buckets
    • Scheduled queries vs streaming exports

Why Now?

  • All existing compute skills (GKE, Cloud Run, Cloud Functions) stop at 'deploy successfully' with no Day-2 operational guidance.
  • Cloud Billing and cost optimization are recurring user concerns; logging is often the surprise cost driver.
  • Google is strongly pushing OpenTelemetry as the unified observability standard; a skill should bridge GCP-native tools with OTel.

Reference Implementation

Google Cloud official docs:

Happy to contribute a SKILL.md draft if this direction is accepted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions