Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
17bd0b0
Replacement updates to docs under App-Monitoring
paulinewv May 6, 2026
5760674
Update src/docs/app-monitoring/check-application-health-after-outage.md
paulinewv May 11, 2026
e267109
Update src/docs/app-monitoring/guidelines-for-sli-and-monitoring.md
paulinewv May 11, 2026
4b467f1
Update src/docs/app-monitoring/user-defined-monitoring.md
paulinewv May 11, 2026
d4888bc
Update src/docs/app-monitoring/user-defined-monitoring.md
paulinewv May 11, 2026
2a2a120
Automation and Resiliency
paulinewv May 27, 2026
f07e6b0
Build deploy and maintain apps section
paulinewv May 27, 2026
2a24bfa
Database and api management section
paulinewv May 27, 2026
45e5185
OpenShift projects and access section
paulinewv May 27, 2026
b67afcd
Reusable code and services section
paulinewv May 27, 2026
b0a1c1d
ArgoCD Notifications
paulinewv May 27, 2026
1fe2a4f
Images section Index file
paulinewv May 27, 2026
6b3b317
Tech docs writing guide
paulinewv May 27, 2026
b2fd57d
Merge branch 'main' into remove-rocketchat-update
paulinewv May 27, 2026
86bc0e1
Build deploy apps part 2
paulinewv May 27, 2026
948da6b
Merge branch 'remove-rocketchat-update' of https://github.com/bcgov/p…
paulinewv May 27, 2026
8c02dca
Update src/docs/app-monitoring/user-defined-monitoring.md
paulinewv May 27, 2026
160a583
User defined monitoring-v2
paulinewv May 27, 2026
c921c3f
Merge branch 'remove-rocketchat-update' of https://github.com/bcgov/p…
paulinewv May 27, 2026
2194a21
Sysdig dashboard template
paulinewv May 28, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -40,9 +40,9 @@ Assuming your application is built in a cloud-native, highly resilient manner th

If there is a platform-wide outage, check the following sources for more information:

- Rocket.Chat channels `#devops-alerts`, `#devops-sos` and `#devops-operations`
- MS Teams [Developer Community](https://teams.microsoft.com/l/team/19%3A6bffce0ac7aa47a1ba9f6d9a7e898db9%40thread.tacv2/conversations?groupId=a80418da-c27b-406e-89ab-7695b61924d8&tenantId=6fdb5200-3d0d-4a8a-b036-d3685e359adc) channels [OpenShift-alerts](https://teams.microsoft.com/l/channel/19%3A2466087e039143fbb5258ec96ad65fab%40thread.tacv2/OpenShift-alerts?groupId=a80418da-c27b-406e-89ab-7695b61924d8&tenantId=6fdb5200-3d0d-4a8a-b036-d3685e359adc) and [OpenShift-emergencyhelp](https://teams.microsoft.com/l/channel/19%3A13e667be3a6a46f8aa208c3cef190f20%40thread.tacv2/OpenShift-emergencyhelp?groupId=a80418da-c27b-406e-89ab-7695b61924d8&tenantId=6fdb5200-3d0d-4a8a-b036-d3685e359adc)

- If you can't access Rocket.Chat, check the off-site [status page](https://status.developer.gov.bc.ca)
- If you can't access MS Teams, check the off-site [status page](https://status.developer.gov.bc.ca)

The Platform Services team keeps the community informed about the status of the outage. While the outage is in progress, there isn't much the team can do. Review the status and stay ready for the outage to end.

Expand All @@ -58,7 +58,7 @@ Connect to your application to see if it's up and running. If an initial check o

Your pods will likely scale back up on their own. However, it's still a good idea to ensure that this has occurred as expected.

- If it hasn't, this may indicate a problem with your deployment-config or with your health checks.
- If it hasn't, this may indicate a problem with your deployment or with your health checks.

- If you're having problems connecting to your application and your pods look healthy, restart them. If that works, that indicates that you may need to build a more complex and robust health check for those pods.

Expand All @@ -74,7 +74,7 @@ This error generally indicates that the pod is having trouble getting to the ima

This is usually an indication that there is still a problem with the cluster.

* Check [Rocket.Chat](https://chat.developer.gov.bc.ca/home)
* Check [MS Teams](https://teams.microsoft.com/l/team/19%3A6bffce0ac7aa47a1ba9f6d9a7e898db9%40thread.tacv2/conversations?groupId=a80418da-c27b-406e-89ab-7695b61924d8&tenantId=6fdb5200-3d0d-4a8a-b036-d3685e359adc)
* Check the[ BCGov Platform Services status page](https://status.developer.gov.bc.ca/)
* Check if your image still exists and hasn't been corrupted.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ sum(last_over_time(sysdig_container_cpu_cores_used{kube_cluster_name=~$Cluster,k

Regarding Utilization, our goal is to maximize it as much as possible. However, achieving a consistent level of 80% or higher isn't always feasible. We'll make our best efforts to achieve this target while ensuring that the application meets other SLOs.

As for handling namespace limitations, we can establish an alert system. When utilization reaches 80%, it will send a notification via RocketChat, indicating the need to allocate additional resources since it's nearing capacity. We can then either allocate more resources or implement a horizontal auto-scaling approach based on the specific circumstances.
As for handling namespace limitations, we can establish an alert system. When utilization reaches 80%, it will send a notification indicating the need to allocate additional resources since it's nearing capacity. We can then either allocate more resources or implement a horizontal auto-scaling approach based on the specific circumstances.

```
sum(last_over_time(sysdig_container_cpu_cores_used{kube_cluster_name=~"silver",kube_namespace_name=~"platform-registry-prod"}[10s])) / (sum(last_over_time(kube_pod_sysdig_resource_limits_cpu_cores{kube_cluster_name=~"silver",kube_namespace_name=~"platform-registry-prod"}[10s])) ) > 0.8
Expand Down
152 changes: 0 additions & 152 deletions src/docs/app-monitoring/sysdig-monitor-create-alert-channels.md

This file was deleted.

4 changes: 2 additions & 2 deletions src/docs/app-monitoring/sysdig-monitor-onboarding.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Monitor your Kubernetes hosts with [Sysdig Monitor](https://sysdig.com/products/
4. [Set up advanced metrics in Sysdig Monitor](../app-monitoring/sysdig-monitor-set-up-advanced-functions.md)
5. [Resource monitoring dashboards](../app-monitoring/resource-monitoring-dashboards.md)

If you have any questions about Sysdig or application monitoring, please contact the Platform Services team on the [#devops-sysdig ](https://chat.developer.gov.bc.ca/channel/devops-sysdig) channel in Rocket.Chat.
If you have any questions about Sysdig or application monitoring, please contact the Platform Services team in the [OpenShift-howto-sysdig](https://teams.microsoft.com/l/channel/19%3A93dff023d40c4440b26cf9c0b236a93f%40thread.tacv2/OpenShift-howto-sysdig?groupId=a80418da-c27b-406e-89ab-7695b61924d8&tenantId=6fdb5200-3d0d-4a8a-b036-d3685e359adc) channel in MS Teams.

---

Expand All @@ -43,7 +43,7 @@ If you have any questions about Sysdig or application monitoring, please contact
- [Sysdig API](https://docs.sysdig.com/en/docs/developer-tools/sysdig-rest-api-conventions/)
- [Monitoring with Sysdig](https://digital.gov.bc.ca/technology/cloud/private/products-tools/sysdig/)
- [Sydig User Profile](https://app.sysdigcloud.com/#/settings/user)
- [devops-sysdig RocketChat channel](https://chat.developer.gov.bc.ca/channel/devops-sysdig)
- [MS Teams OpenShift-howto-sysdig channel](https://teams.microsoft.com/l/channel/19%3A93dff023d40c4440b26cf9c0b236a93f%40thread.tacv2/OpenShift-howto-sysdig?groupId=a80418da-c27b-406e-89ab-7695b61924d8&tenantId=6fdb5200-3d0d-4a8a-b036-d3685e359adc)
- [Sysdig Monitor](https://docs.sysdig.com/en/sysdig-monitor.html)
- [Sysdig Monitor Dashboards](https://docs.sysdig.com/en/dashboards.html)
- [Sysdig Alerts](https://docs.sysdig.com/en/alerts.html)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ To enable Promscrape to find your application metrics, do the following:

- [Set up a team in Sysdig Monitor](../app-monitoring/sysdig-monitor-setup-team.md)
- [Create alert channels in Sysdig Monitor](../app-monitoring/sysdig-monitor-create-alert-channels.md)
- [devops-sysdig RocketChat channel](https://chat.developer.gov.bc.ca/channel/devops-sysdig)
- [MS Teams OpenShift-howto-sysdig channel](https://teams.microsoft.com/l/channel/19%3A93dff023d40c4440b26cf9c0b236a93f%40thread.tacv2/OpenShift-howto-sysdig?groupId=a80418da-c27b-406e-89ab-7695b61924d8&tenantId=6fdb5200-3d0d-4a8a-b036-d3685e359adc)
- [Migrate Using Default Configuration](https://docs.sysdig.com/en/docs/sysdig-monitor/integrations/working-with-integrations/custom-integrations/collect-prometheus-metrics/migrating-from-promscrape-v1-to-v2/#migrate-using-default-configuration)
- [Sysdig Monitor](https://docs.sysdig.com/en/sysdig-monitor.html)
- [Sysdig Monitor Dashboards](https://docs.sysdig.com/en/dashboards.html)
Expand Down
6 changes: 3 additions & 3 deletions src/docs/app-monitoring/sysdig-monitor-setup-team.md
Original file line number Diff line number Diff line change
Expand Up @@ -176,15 +176,15 @@ To access them:

## Troubleshooting

- Error from `sysdig-team` custom resource: if you don't see `Awaiting next reconciliation` after waiting for 5 minutes, contact the Platform Services team on the [#devops-sysdig Rocket.Chat channel](https://chat.developer.gov.bc.ca/channel/devops-sysdig). Make sure to include the OpenShift cluster and namespace information.
- Error from `sysdig-team` custom resource: if you don't see `Awaiting next reconciliation` after waiting for 5 minutes, contact the Platform Services team in the [MS Teams OpenShift-howto-sysdig channel](https://teams.microsoft.com/l/channel/19%3A93dff023d40c4440b26cf9c0b236a93f%40thread.tacv2/OpenShift-howto-sysdig?groupId=a80418da-c27b-406e-89ab-7695b61924d8&tenantId=6fdb5200-3d0d-4a8a-b036-d3685e359adc). Make sure to include the OpenShift cluster and namespace information.

- If you don't see the Sysdig team created, double check that:

- `sysdig-team` custom resource is created in `tools` namespace
- There are no duplicated `sysdig-team` custom resources in dev/test/prod namespaces. Please run `oc -n <NAMESPACE> delete sysdig-team <SYSDIG-TEAM-NAME>` to delete the extra custom resource.
- Your Sysdig account profile matches the email address that you have provided in the `sysdig-team` custom resource. If there is a mismatch, reapply the custom resource.

- If you don't see a default dashboard in your Sysdig team, contact the Platform Services team on the [#devops-sysdig Rocket.Chat channel](https://chat.developer.gov.bc.ca/channel/devops-sysdig).
- If you don't see a default dashboard in your Sysdig team, you can access the [dashboard template in GitHub](https://github.com/bcgov/platform-services-sysdig/tree/main/dashboard-template) and apply it.


---
Expand All @@ -200,7 +200,7 @@ To access them:
- [Sysdig API](https://docs.sysdig.com/en/docs/developer-tools/sysdig-rest-api-conventions/)
- [Monitoring with Sysdig](https://cloud.gov.bc.ca/private-cloud/our-products-in-the-private-cloud-paas/monitoring-with-sysdig/)
- [Sydig User Profile](https://app.sysdigcloud.com/#/settings/user)
- [devops-sysdig RocketChat channel](https://chat.developer.gov.bc.ca/channel/devops-sysdig)
- [MS Teams OpenShift-howto-sysdig channel](https://teams.microsoft.com/l/channel/19%3A93dff023d40c4440b26cf9c0b236a93f%40thread.tacv2/OpenShift-howto-sysdig?groupId=a80418da-c27b-406e-89ab-7695b61924d8&tenantId=6fdb5200-3d0d-4a8a-b036-d3685e359adc)
- [Sysdig Monitor Dashboards](https://docs.sysdig.com/en/dashboards.html)
- [Sysdig Alerts](https://docs.sysdig.com/en/alerts.html)
- [Sysdig Alerts with Kubernetes and PromQL](https://sysdig.com/blog/alerting-kubernetes/)
Expand Down
4 changes: 1 addition & 3 deletions src/docs/app-monitoring/user-defined-monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -210,7 +210,7 @@ Please have this network policy added to ensure that proper metrics are scraped

There is a default `AlertManagerConfig` object called `platform-services-controlled-alert-routing` in each namespace that is not editable that sets out some default alerting rules. It ensures that the Products Tech Leads and Product Owner get the base level alerts.

You can add an additional `AlertManagerConfig` to add more email contacts, or set up another notification channel like RocketChat.
You can add an additional `AlertManagerConfig` to add more email contacts, and in future notification channels like MS Teams may be supported.

```yaml
apiVersion: monitoring.coreos.com/v1beta1
Expand Down Expand Up @@ -256,8 +256,6 @@ spec:

If you want to use an email receiver, keep it the same and only update the `name` and `to` fields, as well as the `Product Name Here` in the `html`. The rest ensures that the email is sent correctly and with nice formatting.

You can add multiple email receivers, or other options like PagerDuty or a webhook to RocketChat.

The `route` can be set to group alerts, but include at least `namespace` and `severity` to ensure the email template works correctly. You can also create sub-routes with different receivers or repeat intervals and match on a subset of alerts.

You can read more in the [Alertmanager docs](https://prometheus.io/docs/alerting/latest/configuration/) on how to configure receivers and routes.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ The best way to keep on top of these issues in a proactive manner is to monitor

Many of these notification options provide your team an opportunity to act to prevent an outage before it starts. Finding out, for example, that your storage is almost full before it fills up means that your team can act to deal with the storage issue before it causes an outage or service issue. Other notifications can let you know the moment an outage starts and can provide extremely valuable information about the cause of the outage, so your team can begin troubleshooting right away without having to wait for users to inform you of the problem.

Your team should also ensure you are in the [#devops-alerts channel](https://chat.developer.gov.bc.ca/channel/devops-alerts) in Rocket.Chat where notices of upcoming maintenance are posted. There are not many messages sent in this channel, so we recommend switching your Rocket.Chat notification settings for the channel to **All messages**.
Your team should also ensure you are in the [OpenShift-alertsl](https://teams.microsoft.com/l/channel/19%3A2466087e039143fbb5258ec96ad65fab%40thread.tacv2/OpenShift-alerts?groupId=a80418da-c27b-406e-89ab-7695b61924d8&tenantId=6fdb5200-3d0d-4a8a-b036-d3685e359adc) in MS Teams where notices of upcoming maintenance are posted. There are not many messages sent in this channel, so we recommend switching your MS Teams notification settings for the channel to **All messages**.

## A highly available application

Expand Down
2 changes: 1 addition & 1 deletion src/docs/automation-and-resiliency/argo-cd-usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -388,7 +388,7 @@ Enter the following information:

* Payload URL: `https://gitops-shared.apps.CLUSTERNAME.devops.gov.bc.ca/api/webhook`
* Content type: application/json
* Secret: (This is just to prevent abuse of the API endpoint by outside parties. You can find the secret in the description of the Rocketchat channel "#devops-argocd".)
* Secret: (This is just to prevent abuse of the API endpoint by outside parties. You can find the secret in the description of the MS Teams channel [OpenShift-howto-argocd](https://teams.microsoft.com/l/channel/19%3A5d971c5af20d4c6dbab8f7213671aaaf%40thread.tacv2/OpenShift-howto-argocd?groupId=a80418da-c27b-406e-89ab-7695b61924d8&tenantId=6fdb5200-3d0d-4a8a-b036-d3685e359adc).)
* SSL verification: keep the default "Enable SSL verification"
* Which events would you like to trigger this webhook?: This is up to you to determine the conditions under which the webhook is triggered.
* Active: keep this box checked in order to enable the webhook
Expand Down
Loading
Loading