Production Readiness Scorecard Example

Production readiness overview

Production readiness is about ensuring your software is secure, reliable, and fully observable for operational use. It minimizes downtime, improves user experiences, and reduces the chance of critical failures in a live environment.

Achieving production readiness involves setting up cross-functional standards like testing, monitoring, documentation, code reviews, observability, security controls, and deployment workflows, often across different infrastructure like AWS, GitHub, and Kubernetes. However, keeping up with the velocity of modern software development is a challenge. Many organizations still rely on spreadsheets, wikis, Git repositories, or project management software to manage this process.

But production readiness isn’t a “one-time” check; it’s an ongoing process. With software components constantly evolving, it’s essential to ensure that they remain production-ready over time. That means having a framework in place to assess ongoing health, check for functionality, latency, and error rates, and monitor adherence to standards. This allows teams to continuously align with best practices and reduce the burden on developers.

Key Elements to Track for Production Readiness

🥇 Gold Tier Requirements:

  • Code Coverage -> Achieve at least 90% code coverage to ensure comprehensive testing.

  • Critical Issues -> No open critical issues in tracking systems like Jira.

  • Vulnerabilities -> Zero known vulnerabilities identified by tools like SonarQube.

  • Alerting Configuration -> System alerts set up for monitoring key metrics.

  • Escalation Policies -> Defined on-call escalation policies with at least two levels.

  • Secure Design Reviews -> Conduct periodic reviews for secure design validation.

  • Availability and Latency SLOs -> Meet set SLO targets for availability and latency consistently.

  • Traffic KPIs -> Track key traffic metrics to monitor performance in real-time.

  • P99 Latency Tracking -> Maintain logs for P99 latency metrics.

  • Error Rate Tracking -> Ensure error tracking is enabled to capture incidents in real-time.

🥈 Silver Tier Requirements:

  • Health Dashboards -> Have dashboards for real-time system health visibility.

  • Code Coverage -> Maintain code coverage of at least 75%.

  • Advanced Security -> Enable advanced security scanning (e.g., Dependabot, CodeQL).

  • Rollback Documentation -> Document rollback processes for quick recovery.

  • Runbooks -> Provide runbooks for all critical processes.

  • Observability Integration -> Ensure observability tools are integrated as sources for each service.

🥉Bronze Tier Requirements:

  • Git Ignore -> Confirm a .gitignore file is present in each repository.

  • Git Integration -> Integrate the Git provider as a data source in the catalog.

  • Incident Management Integration -> Integrate with incident management tools to handle issues.

  • Peer Reviews for PRs -> Require peer reviews for all pull requests.

  • Recent Deployments -> Track and log recent deployments.

  • Owner Assignment -> Ensure at least one owner is assigned to each service.

  • On-Call Definition -> Establish an on-call rotation for handling incidents.

  • README File -> Ensure that each repository includes a README for quick reference.

  • SLOs for Availability and Latency -> Define SLOs for key metrics such as availability and latency.

By following these tiers, you can ensure that your services remain production-ready and avoid manual processes to keep software aligned with ongoing standards. With a framework in place, you’ll not only meet today’s needs but also be prepared to scale and adapt as software evolves.

Example production readiness scorecard code definition

Example production readiness scorecard code definition
{
  "id": "service-production-readiness",
  "title": "Production Readiness Checklist",
  "description": "This checklist contains points that must be satisfied during implementation and verified prior to release. Please note that all these items must still be satisfied post release.",
  "isActive": true,
  "blueprintId": "service",
  "ranks": [
    {
      "id": "bronze",
      "rules": [
        {
          "id": "readme-available",
          "title": "Readme is available",
          "description": "The readme property for this service is being populated from the git provider.",
          "conditions": [
            {
              "field": "data.properties.readme",
              "operator": "like",
              "value": "_%"
            }
          ]
        },
        {
          "id": "required-peer-reviews-for-prs",
          "title": "Pull requests require at least 1 peer review",
          "description": "The repo of this service is configured to have at least 1 peer review mandatory to approve pull requests. Validated by assessing whether the property numRequiredApprovals of this service is greater than 1.",
          "conditions": [
            {
              "field": "data.properties.numRequiredApprovals",
              "operator": "gte",
              "value": 1
            }
          ]
        },
        {
          "id": "recent-deployments",
          "title": "At least one deployment this month?",
          "description": "Validates whether there has been at least one deployment in this service since the start of the month.",
          "conditions": [
            {
              "field": "data.calculationProperties.deploymentCountProd.1m.value",
              "operator": "gte",
              "value": 1
            }
          ]
        },
        {
          "id": "owner-is-assigned",
          "title": "Service Owner is Assigned",
          "description": "The service owner team is specified",
          "conditions": [
            {
              "field": "SELECT (s.relations->'owner'->'value' is not null)::integer AS num_owners FROM entities s WHERE s.blueprintid = 'service' AND s.id = '{{ data.id }}'",
              "expression": true,
              "operator": "gte",
              "value": 1
            }
          ]
        },
        {
          "id": "on-call-defined",
          "title": "At least one person is on call",
          "description": "The property currentOnCalls has at least one person assigned to this service.",
          "conditions": [
            {
              "field": "data.properties.currentOnCalls",
              "operator": "like",
              "value": "[_%]"
            }
          ]
        }
      ]
    },
    {
      "id": "silver",
      "rules": [
        {
          "id": "monitoring-dashboards-defined",
          "title": "Monitoring dashboards are defined",
          "description": "At least one link was added to the monitoring dashboards url property of this service.",
          "conditions": [
            {
              "field": "data.properties.monitoringDashboards",
              "operator": "like",
              "value": "[_%]"
            }
          ]
        },
        {
          "id": "rollback-process-is-documented",
          "title": "The rollback process is documented",
          "description": "At least one link was added to the rollback process url property of this service.",
          "conditions": [
            {
              "field": "data.properties.rollbackProcess",
              "operator": "like",
              "value": "_%"
            }
          ]
        },
        {
          "id": "code-coverage-ge-75",
          "title": "Code coverage >= 75%",
          "description": "The code for this service's repo is greater or equal to 85%",
          "conditions": [
            {
              "field": "data.properties.codeCoverage",
              "operator": "gte",
              "value": 75
            }
          ]
        },
        {
          "id": "runbooks-are-available",
          "title": "There are one or more runbooks available",
          "description": "At least one link was added to the runbooks url property of this service.",
          "conditions": [
            {
              "field": "data.properties.runbooks",
              "operator": "like",
              "value": "[_%]"
            }
          ]
        }
      ]
    },
    {
      "id": "gold",
      "rules": [
        {
          "id": "code-coverage-ge-90",
          "title": "Code coverage >= 90%",
          "description": "The code for this service's repo is greater or equal to 85%",
          "conditions": [
            {
              "field": "data.properties.codeCoverage",
              "operator": "gte",
              "value": 90
            }
          ]
        },
        {
          "id": "no-critical-jira-issues",
          "title": "No critical jira issues",
          "description": "All critical Jira issues with priority in 'P0' and 'P1 have their status in 'Done', 'Resolved' or 'Closed'.",
          "conditions": [
            {
              "field": "data.properties.openCriticalJiraIssues",
              "operator": "eq",
              "value": 0
            }
          ]
        },
        {
          "id": "open-vulnerabilities",
          "title": "No open vulnerabilities",
          "description": "There is no known vulnerability assigned to this service with their status not in 'Resolved' or 'Closed'.",
          "conditions": [
            {
              "field": "data.properties.openVulnerabilities",
              "operator": "eq",
              "value": 0
            }
          ]
        },
        {
          "id": "alerting-tool-defined",
          "title": "Alerting tool is configured",
          "description": "At least one link was added to the alertingTool url property of this service. The link should redirect to the alerting tool filtered by the alerts assigned to this service.",
          "conditions": [
            {
              "field": "data.properties.alertingTools",
              "operator": "like",
              "value": "_%"
            }
          ]
        },
        {
          "id": "on-call-escalation-policies",
          "title": "On call escalation policies >= 2",
          "description": "At least 2 levels are defined in the escalation policy, so that if the first on-call does not acknowledge within the defined time frame, there is a backup.",
          "conditions": [
            {
              "field": "SELECT coalesce(sum(jsonb_array_length(coalesce(s.properties->'escalationPolicies'->'url', '[]'))), 0) AS num_urls FROM entities s WHERE s.blueprintid = 'service' AND s.id = '{{ data.id }}'",
              "expression": true,
              "operator": "gte",
              "value": 2
            }
          ]
        },
        {
          "id": "secure-by-design-review",
          "title": "Secure by Design Review",
          "description": "Has a secure by design review been conducted? This validates whether the last design review by property of this service has been assigned to someone.",
          "conditions": [
            {
              "field": "data.properties.lastDesignReviewBy",
              "operator": "like",
              "value": "_%"
            }
          ]
        }
      ]
    }
  ],
  "medianRank": "noRank"
}

Last updated