-
Notifications
You must be signed in to change notification settings - Fork 105
gvfs-helper: auto-retry after network errors, resource throttling, split GET and POST semantics #208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@wilbaker Thanks for the pointers. |
See microsoft/git#208 for the Git changes. This should make the gvfs-helper more robust to intermittent failures when a single network call fails.
|
/azp run Microsoft.git |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@jeffhostetler after the 503 update, I was unable to repro any network failures in the C# functional tests. (I got a different flaky failure in the C# code, but that's different ;) ) I'm happy to re-test and approve after you do the cleanup to make this not WIP. |
|
Thanks for the confirmation! Almost finished.... |
|
/azp run Microsoft.git |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Add robust-retry mechanism to automatically retry a request after network
errors. This includes retry after:
[] transient network problems reported by CURL.
[] http 429 throttling (with associated Retry-After)
[] http 503 server unavailable (with associated Retry-After)
Add voluntary throttling using Azure X-RateLimit-* hints to avoid being
soft-throttled (tarpitted) or hard-throttled (429) on later requests.
Add global (outside of a single request) azure-throttle data to track the
rate limit hints from the cache-server and main Git server independently.
Add exponential retry backoff. This is used for transient network problems
when we don't have a Retry-After hint.
Move the call to index-pack earlier in the response/error handling sequence
so that if we receive a 200 but yet the packfile is truncated/corrupted, we
can use the regular retry logic to get it again.
Refactor the way we create tempfiles for packfiles to use
<odb>/pack/tempPacks/ rather than working directly in the <odb>/pack/
directory.
Move the code to create a new tempfile to the start of a single request
attempt (initial and retry attempts), rather than at the overall start
of a request. This gives us a fresh tempfile for each network request
attempt. This simplifies the retry mechanism and isolates us from the file
ownership issues hidden within the tempfile class. And avoids the need to
truncate previous incomplete results. This was necessary because index-pack
was pulled into the retry loop.
Minor: Add support for logging X-VSS-E2EID to telemetry on network errors.
Minor: rename variable:
params.b_no_cache_server --> params.b_permit_cache_server_if_defined.
This variable is used to indicate whether we should try to use the
cache-server when it is defined. Got rid of double-negative logic.
Minor: rename variable:
params.label --> params.tr2_label
Clarify that this variable is only used with trace2 logging.
Minor: Move the code to automatically map cache-server 400 responses
to normal 401 response earlier in the response/error handling sequence
to simplify later retry logic.
Minor: Decorate trace2 messages with "(cs)" or "(main)" to identify the
server in log messages. Add params->server_type to simplify this.
Signed-off-by: Jeff Hostetler <[email protected]>
9a844ff to
0519894
Compare
gvfs-helper-client.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we are assuming that any queued objects will get a flush request eventually? That sounds reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've split the queued and immediate usage now. The dry-run/pre-scan loops already handle the queue and drain (for missing blobs usually). The main difference now is that any missing trees (or commits) during those loops will be immediately fetched in isolation, but the queue will remain.
Expose the differences in the semantics of GET and POST for
the "gvfs/objects" API:
HTTP GET: fetches a single loose object over the network.
When a commit object is requested, it just returns
the single object.
HTTP POST: fetches a batch of objects over the network.
When the oid-set contains a commit object, all
referenced trees are also included in the response.
gvfs-helper is updated to take "get" and "post" command line options.
the gvfs-helper "server" mode is updated to take "objects.get" and
"objects.post" verbs.
For convenience, the "get" option and the "objects.get" verb
do allow more than one object to be requested. gvfs-helper will
automatically issue a series of (single object) HTTP GET requests
and creating a series of loose objects.
The "post" option and the "objects.post" verb will perform bulk
object fetching using the batch-size chunking. Individual HTTP
POST requests containing more than one object will be created
as a packfile. A HTTP POST for a single object will create a
loose object.
This commit also contains some refactoring to eliminate the
assumption that POST is always associated with packfiles.
In gvfs-helper-client.c, gh_client__get_immediate() now uses the
"objects.get" verb and ignores any currently queued objects.
In gvfs-helper-client.c, the OIDSET built by gh_client__queue_oid()
is only processed when gh_client__drain_queue() is called. The queue
is processed using the "object.post" verb.
Signed-off-by: Jeff Hostetler <[email protected]>
a5d67f2 to
5c65e9a
Compare
|
@derrickstolee I think I'm done tinkering with this one. 2.20191023.1 has already been thru the functional tests and is good. Just did a final squash on my fixups and am running 2.20191023.2 thru the its paces. |
derrickstolee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll rebase these commits onto tentative/features/sparse-checkout-2.24.0 for #214 after you merge.
|
Thanks for all your help! |
…out-2.23.0 Upgrade to 2.20191023.7-sc which corresponds to commit microsoft/git@a782a7e and includes the following changes (since 2.20191015.2-sc): - microsoft/git#208 - microsoft/git#210
Resolves #195. Includes the following updates to `microsoft/git`: * microsoft/git#208: gvfs-helper: auto-retry after network errors, resource throttling, split GET and POST semantics. * microsoft/git#215: gvfs-helper: dramatically reduce progress noise.
I'm marking this WIP because I haven't done a cleanup round nor squashed things yet.
But I want to get the CI builds a chance to run tonight.
This series attempts to:
[x] auto-retry after network outages
[x] throttle back when requested (or demanded) by the server.
Questions:
[done] What is the right default for the network retry limit?
[done] Should the throttle back have a time limit? (It's one thing to wait 3 or 4 minutes between
packfiles because we hit it too hard, but another if it says we should wait an hour or two.)
[no] Should the network retry look at the amount of data received and try to resume it?
[no] Should the network retry split large packfile requests if we can tell the user's network is flakey?
Basic testing with 5 concurrent fetches shows that it is pretty easy to get throttled and
makes me wonder if we should even bother with multi-threading this. Perhaps we just
limit it to waiting for index-pack in another thread, but only plan to have 1 network thread.
Or maybe that is just when talking to the main server -- we might be able to multithread
when talking to the cache-server.