Skip to content

Issue Classification Bot #95086

@JacksonKearl

Description

@JacksonKearl

Background

We have too many issues to manually classify them all, and even if that isn't strictly true now it will be as we continue to grow. We previously had a bot to classify them, but its training fell out of date and it eventually had to be killed.

Proposal

We implement a new set of actions which automate the processes of:

1. Collecting Data:

  • An initial long-running scrape of all existing issue data, followed by as-generated collection of future data
    • Update: Scraping all issue data took much less time than I expected (<1hr), we could do the entire scrape monthly, which gives us more flexibility to change the data we use and also gets us out of keeping customer data around.
  • Collect for each labeling event:
    • Label name
    • Name of user/bot adding the label
    • Issue title&body at time of label
    • Issue final title&body
  • Unknown: where is the raw data stored? (~30 MB compressed)
    • See update above. May not need to store.
  • Unknown: GDPR implications of keeping data?
    • See update above. May not need to store.

2. Running Training

  • Monthly retraining
  • C#/ML.NET, either hosted via GitHub Action (Docker) or using a more powerful Azure machine
  • Potentially implemented with AutoML, which is has been used by the dotnet/corefx repo for similar issue feature-area classification with good accuracy.
  • Unknown: where is the model stored? (dotnet/corefx model is in the 10's of MB (compressed))
  • Unknown: GDPR implications of storing model?

3. Labeling Issues

  • Either live, or in batches (every half hour? hour?)
  • C# running as a GitHub Action (Docker)

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions