Skip to content

Establish convention for capturing system incidents #730

@grahamalama

Description

@grahamalama

We are trying to reduce the frequency of incidents which cause customers to ask, "why isn't my bug syncing?". In order answer the question "are we reducing the frequency?", we need to capture this data consistently.

In an ADR, we should establish a workflow for tracking incidents, including the start time and resolution time. This way, we can capture both the number of incidents and the average time to resolve incidents.

A non-exhaustive list of "incidents" might include:

  • webhook queue becomes disabled
  • a workflow or workflows record n partial syncs over some duration of time
  • a workflow or workflows record n total sync failures over some duration of time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions