Skip to content

Conversation

@dnhatn
Copy link
Member

@dnhatn dnhatn commented Feb 8, 2018

The shard not-available exceptions are currently ignored in the
replication as the best effort avoids failing not-yet-ready shards.
However these exceptions can also happen from fully active shards. If
this is the case, we may have skipped important failures from replicas.
Since #28049, only fully initialized shards are received write requests.
This restriction allows us to handle all exceptions in the replication.

There is a side-effect with this change. If a replica retries its peer
recovery second time after being tracked in the replication group, it
can receive replication requests even though it's not-yet-ready. That
shard may be failed and allocated to another node even though it has a
good lucene index on that node.

This PR does not change the way we report replication errors to users,
hence the shard not-available exceptions won't be reported as before.

Relates #28049
Relates #28534

The shard not-available exceptions are currently ignored in the
replication as the best effort avoids failing not-yet-ready shards.
However these exceptions can also happen from fully active shards. If
this is the case, we may have skipped important failures from replicas.
Since elastic#28049, only fully initialized shards are received write requests.
This restriction allows us to handle all exceptions in the replication.

There is a side-effect with this change. If a replica retries its peer
recovery second time after being tracked in the replication group, it
can receive replication requests even though it's not-yet-ready. That
shard may be failed and allocated to another node even though it has a
good lucene index on that node.

This PR does not change the way we report replication errors to users,
hence the shard not-available exceptions won't be reported as before.
@dnhatn dnhatn added >enhancement :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v7.0.0 v6.3.0 labels Feb 8, 2018
@dnhatn dnhatn requested review from bleskes and ywelsch February 8, 2018 14:16
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think it would be good to only backport to 6.x once this has been running a few days on CI without issues. Also wait on merging this until #28558 is in.

@dnhatn
Copy link
Member Author

dnhatn commented Feb 8, 2018

Thanks @ywelsch for your quick review.

@dnhatn dnhatn merged commit dbf9fb3 into elastic:master Feb 8, 2018
@dnhatn dnhatn deleted the replication-exception branch February 8, 2018 23:05
dnhatn added a commit that referenced this pull request Feb 17, 2018
The shard not-available exceptions are currently ignored in the
replication as the best effort avoids failing not-yet-ready shards.
However these exceptions can also happen from fully active shards. If
this is the case, we may have skipped important failures from replicas.
Since #28049, only fully initialized shards are received write requests.
This restriction allows us to handle all exceptions in the replication.

There is a side-effect with this change. If a replica retries its peer
recovery second time after being tracked in the replication group, it
can receive replication requests even though it's not-yet-ready. That
shard may be failed and allocated to another node even though it has a
good lucene index on that node.

This PR does not change the way we report replication errors to users,
hence the shard not-available exceptions won't be reported as before.

Relates #28049
Relates #28534
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Apr 27, 2018
Since elastic#28049, only fully initialized shards are received write requests.
This enhancement allows us to handle all exceptions. In elastic#28571, we
started strictly handling shard not-available exceptions and tried to
keep the way we report replication errors to users by only reporting if
the error is not shard-not-available exceptions. However, since then we
unintentionally always log warn for all exception. This change restores
to the previous behavior to log warn only if an exception is not a shard
not-available exception.
dnhatn added a commit that referenced this pull request Apr 27, 2018
Since #28049, only fully initialized shards are received write requests.
This enhancement allows us to handle all exceptions. In #28571, we
started strictly handling shard-not-available exceptions and tried to
keep the way we report replication errors to users by only reporting if
the error is not shard-not-available exceptions. However, since then we
unintentionally always log warn for all exception. This change restores
to the previous behavior which logs warn only if an exception is not a
shard-not-available exception.

Relates #28049
Relates #28571
dnhatn added a commit that referenced this pull request Apr 27, 2018
Since #28049, only fully initialized shards are received write requests.
This enhancement allows us to handle all exceptions. In #28571, we
started strictly handling shard-not-available exceptions and tried to
keep the way we report replication errors to users by only reporting if
the error is not shard-not-available exceptions. However, since then we
unintentionally always log warn for all exception. This change restores
to the previous behavior which logs warn only if an exception is not a
shard-not-available exception.

Relates #28049
Relates #28571
dnhatn added a commit that referenced this pull request Apr 27, 2018
Since #28049, only fully initialized shards are received write requests.
This enhancement allows us to handle all exceptions. In #28571, we
started strictly handling shard-not-available exceptions and tried to
keep the way we report replication errors to users by only reporting if
the error is not shard-not-available exceptions. However, since then we
unintentionally always log warn for all exception. This change restores
to the previous behavior which logs warn only if an exception is not a
shard-not-available exception.

Relates #28049
Relates #28571
ywelsch added a commit that referenced this pull request Jun 11, 2018
Swallowing these exceptions is dangerous as they can result in replicas going out-of-sync with the primary.

Follow-up to #28571
ywelsch added a commit that referenced this pull request Jun 11, 2018
Swallowing these exceptions is dangerous as they can result in replicas going out-of-sync with the primary.

Follow-up to #28571
ywelsch added a commit that referenced this pull request Jun 11, 2018
Swallowing these exceptions is dangerous as they can result in replicas going out-of-sync with the primary.

Follow-up to #28571
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Sep 11, 2018
We fail to notify the resync listener if the resync replication hits a
shard unavailable exception. Moreover, we no longer need to swallow
these unavailable exceptions.

Relates elastic#28571
Closes elastic#33613
dnhatn added a commit that referenced this pull request Sep 13, 2018
We fail to notify the resync listener if the resync replication hits a
shard unavailable exception. Moreover, we no longer need to swallow
these unavailable exceptions.

Relates #28571
Closes #33613
dnhatn added a commit that referenced this pull request Sep 13, 2018
We fail to notify the resync listener if the resync replication hits a
shard unavailable exception. Moreover, we no longer need to swallow
these unavailable exceptions.

Relates #28571
Closes #33613
dnhatn added a commit that referenced this pull request Sep 13, 2018
We fail to notify the resync listener if the resync replication hits a
shard unavailable exception. Moreover, we no longer need to swallow
these unavailable exceptions.

Relates #28571
Closes #33613
@colings86 colings86 removed the v7.0.0 label Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement v6.3.0 v7.0.0-beta1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants