-
Notifications
You must be signed in to change notification settings - Fork 379
Add one pager for improved container image lifecycle #10352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
595451e
aa913cb
f888f91
1cc40fe
f9bd1ea
42e0bb9
811639e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,54 @@ | ||
| # Improved Docker Container Image Lifecycle | ||
|
|
||
| As part of #10349 to improve our docker container security and sustainability, we need to improve the container image lifecycle. Currently, our container definitions are stable, but rarely updated, with some of the definitions dating back several years. We don't have means to ensure that all container images we use contain the latest OS patches and CVE fixes. One of the main points of this proposal is to ensure that the containers are updated regularly, accepting servicing updates form the OS on a regular basis. The major business goals of this work are to make sure that: | ||
|
|
||
| - Our container images are re-built regularly and they contain the latest underlying OS patches and CVE fixes | ||
| - There is a mechanism for updating the docker containers used by product teams so that they are always on the latest version of each container image | ||
mmitche marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - There is a process and tools implemented for identifying and removing images that are out of date | ||
michellemcdaniel marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| - There is a process and tools to delete old container images (older than 3-6 months) from MCR | ||
michellemcdaniel marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| - All images used in the building and testing of .NET use Microsoft-approved base images, either Mariner where appropriate, or [Microsoft Artifact Registry-approved images](https://eng.ms/docs/more/containers-secure-supply-chain/approved-images) where Mariner is insufficient | ||
|
|
||
| ## Stakeholders | ||
|
|
||
| - .NET Core Engineering | ||
| - .NET Acquisition and Deployment | ||
| - .NET Product teams | ||
|
|
||
| ## Risks | ||
|
|
||
| - Will the new implementation of any existing functionality cause breaking changes for existing consumers? | ||
|
|
||
| The major risk in this portion of the epic is finding and updating all container usages by product teams, and making sure that moving them to the latest versions of the container images doesn't break their builds/tests because of missing artifacts. Our goal is to use docker tags to label the latest known good of each container image, and replace usages of specific docker image tags with a `<os>-<version>-<other-identifying-info>-latest` tag. That way, much like with helix images, their builds and tests will be updated automatically when we deploy a new latest version. In the transition to latest images, we may find that older versions of a container may have different versions of artifacts installed on those containers, which could affect builds and tests. We will need to be prepared to help product teams identify these issues and work through them. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The Matrix of Truth (MoT) code now contains some heuristics to derive the OperatingSystemId (the unified way to connect OS definitions between MoT, OSOB and RIDs) for docker environments. The OperatingSystemId definitions can be found in the helix-machines repo at os-definitions.json. When designing the new tagging scheme, please make sure that it's possible to obtain the OperatingSystemId for our containers in some algorithmic way rather than by the currently used heuristics, that might not be 100% accurate in some cases. To be clear, I'm not proposing to make the OperatingSystemID to be used in the new tagging schema directly, but I'm raising this for awareness as this is the last remaining issue in unifying concept of an operating system identification across our infrastructure.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In what way is the RID used from this data? Because I see Mariner listed there with a RID of cc
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think having the MoT OperatingSystemId as a part of the docker tag wouldn't be a bad idea (Linux-Ubuntu-20.04-AMD64--latest for example).
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It doesn't sound like there will be a RID for Mariner. There's no functional need.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let me share a bit of context here. When we started designing the Matrix of Truth, it became very soon apparent that we need some unified concept of identifying a particular version of operating system so that we could unify the concept of OS as used by PMs to define what versions of OSes are supported by each version of the product with our internal infrastructure and tooling, such as the OSOB. Now even though RIDs aren't directly an OS description (but a concept of expressing target platforms where the application runs), it was clear that they are closely related to our OS definitions and that we should think how to incorporate them into our model. From discussion with @ericstj and @eerhardt, we've learnt that each version of a given OS has exactly one RID that can be thought of a "primary RID" for that given version OS. This RID is then what is used in the os-definitions.json mentioned above. Having said that, I don't think these values are currently used anywhere besides our MoT PowerBI reports. The point with non-existing RID for Mariner is of course valid, @ericstj - would you know if there are plans for introducing RIDs for Mariner and if not, what RID would correspond to the primary rid of this platform?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here's an issue for that: dotnet/runtime#65566 RID happens to be useful for you today in its current form. It may change, but that doesn't mean you can't continue to use the algorithm that the host had for identifying a RID. It's really just an algorithm that encodes a number of significant machine characteristics into a string. If the host changes that algorithm you can just maintain your own copy of one that works for you. Even the host's algorithm requires regular maintenance to ensure it continues to provide unique and meaningful strings per distro.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for clarification Eric!
mmitche marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| - What are your assumptions? | ||
|
|
||
| Assumptions include: | ||
| - The Matrix of Truth work will enable us to identify all pipelines and branches that are using docker containers and which images they are using | ||
| - We will be able to extend the existing publishing infrastructure to also idetify images that are due for removal | ||
| - All of our existing base images can be replaced with MAR-approved images | ||
|
||
| - Most of the official build that is currently built in docker containers can be built on Mariner | ||
michellemcdaniel marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| - MAR-approved images are updated with OS patches and CVE fixes | ||
|
|
||
| - What are your unknowns? | ||
michellemcdaniel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| Unknowns include: | ||
| - How will we identify the LKG for each docker image? | ||
| - What testing is currently in place for docker images, so that we can have confidence that updating the `latest` image will not break product teams? | ||
| - What is the rollback story for the `latest` tagging scheme? | ||
| - If the MAR-approved images are not updated on a regular basis, how do we apply OS patches and CVE fixes to the base operating systems? | ||
michellemcdaniel marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| - What dependencies will this epic/feature(s) have? | ||
|
|
||
| This feature will depend heavily on MAR-approved images (and whatever updating scheme they have for updating base images), as well as the existing functionality for building and publishing our docker container images. We will want to expand the existing functionality to allow us to 1) identify the last known good of each docker image and 2) tag that LKG with a descriptive `latest` tag. | ||
|
|
||
| ## Serviceability of Feature | ||
|
|
||
| ### Rollout and Deployment | ||
|
|
||
| As part of this work, we will need to implement a rollout story for the new tagging feature. We do not want every published image to immediately be tagged as `latest`. In fact, we may want to implement two different tags: `latest` and `staging`. In this scheme, we would branch the `dotnet-buildtools-prereqs-docker` repo so that we have a production branch. Every image published from main would be tagged `staging` which could then be used in testing, much like the images in our Helix -Int pool. This tag would be used for identifying issues ahead of time so that when we rollout, we will be more confident that the images we are tagging as `latest` will be safe for our customers. The rollout would be performed on a weekly basis, much like our helix-service, helix-machines, and arcade-services rollouts. We roll all of the known good changes in the main branch to production, and publish those images with the `latest` tag. This would allow us to reuse the same logic for both staging and latest. | ||
|
||
|
|
||
| We will also need a rollback story so that if an image breaks a product team's build or test, we can untag that image and retag the previous `latest` image. A rollback should be as simple as reverting a previous change and publishing the images at the new commit. While this image may be identical to a previously published image, it will effectively be treated as a new version. | ||
|
||
|
|
||
| ### FR and Operations Handoff | ||
|
|
||
| We will create documentation for managing the tags so that when a rollback needs to occur, FR will be able to make those changes. Additionally, we will create documentation and processes that can be used by Operations and/or the vendors to handle any manual OS/base image updating or removing of old and out-of-date images from MCR, as necessary. We will also create documentation for responding to customer requests for new docker images, including where to get the base images, and how to install required dependencies (though that is coming in a different one pager). | ||
Uh oh!
There was an error while loading. Please reload this page.