Cloud DevOps Team

If you’re looking for Delivery documentation instead of DevOps, visit this page.

Vision

The two primary pillars of the Cloud Team are Availability and Observability as defined in RFC 498

This team ensures that Sourcegraph.com has the same reliability and availability as other world-class SaaS offerings. This team is also responsible for Observability monitoring and tooling to ensure that we are meeting these goals.

More can be found in our Cloud Vision

Areas of Ownership

The Cloud DevOps team is responsible for the infrastructure used to host Sourcegraph.com. This includes but is not limited to dashboard and observability, uptime and reliability, and managing our cloud provider resources. This team works closely with the other teams in the Cloud org to ensure sourcegraph.com is available and functional for our users. Notably, this team has the ability to slow or stop rollouts to Sourcegraph.com if needed to improve stability.

This team is responsible for

  • Continuous deployment of Sourcegraph.com
  • Cloud monitoring infrastructure (Prometheus / Grafana)
  • Managed instances

Contact & Support Guideline

  • The best way to contact the cloud-devops team is in the #cloud-devops slack channel. If it is urgent, please cc @devops-support in your message.
  • Issues with team/devops labels and @team/cloud-devops team on GitHub
  • Please do not @cloud-devops-team or directly message individual teammates - this is to try and protect their focus. Instead, use the @devops-support handle, and the On-Call DevOps teammate will be notified.

Content

How we work

We primarily work within the following repositories:

Issue tracking

The DevOps GitHub Project board is the single source of truth.

Planning, Sync & Retro

We don’t have sprint or cycles. We are mostly task-driven under the quarterly goals.

  1. Planning/Sync (weekly)

    • This is the good time to call out issues you would like to discuess with everyone
    • Review our GitHub Project board to prioritize tasks for the week
  2. Retro (bi-weekly)

    • A review of what we did for learing purposes

Standup (async)

We use geekbot to keep others informed of what’s going on asynchronously. This is a good time to share your progress and ask for help to remove any blockers.

On-call

We maintain an on-call rotation in Opsgenie. Reponsibilities of the teammate who is on-call include:

  • Acknowledging incoming alerts
  • Initiating incident procedures
  • Publishing postmortems
  • Adding issues to the DevOps GitHub board with the label devops/support for external support requests, such as Slack messages addressed to @devops-support

Issue handover

If the on-call teammate reaches the end of their shift with unresolved issues, they inform the rest of the team in #cloud-devops-internal.

Goals

Goals

TODO

Goals

  • Assist the Cloud-SaaS team with RFC 525
  • Stabilize our CD environment
  • Create a pre-production environment for Cloud
  • Standardize our monitoring, logging and error reporting systems
  • Migrate our zonal cluster to a regional cluster and document the process
  • Complete assigned security tasks on behalf of security and provide evidence of completion