PoA network, all the sealers are waiting for each other after 2 months running, possible deadlock?

#### System information

My current version is:

```
Geth
Version: 1.8.17-stable
Git Commit: 8bbe72075e4e16442c4e28d999edee12e294329e
Architecture: amd64
Protocol Versions: [63 62]
Network Id: 1
Go Version: go1.10.1
Operating System: linux
GOPATH=
GOROOT=/usr/lib/go-1.10
```

#### Expected behaviour

Keep the normal signing .

#### Actual behaviour

I was running a go-ethereum private network with 6 sealers. 

Each sealer is run by:

```
directory=/home/poa
command=/bin/bash -c 'geth --datadir sealer4/  --syncmode 'full' --port 30393 --rpc --rpcaddr 'localhost' --rpcport 8600 --rpcapi='net,web3,eth' --networkid 30 --gasprice '1' -unlock 'someaddress' --password sealer4/password.txt --mine '

```

The blockchain was running good for about 1-2 months. 

Today i found that all the nodes were having issues. Each node was emmiting the message "Signed recently, must wait for others"

I check out the logs and i found this message every 1 hour, no more information, the nodes where not mining:

> Regenerated local transaction journal transactions=0 accounts=0
> Regenerated local transaction journal transactions=0 accounts=0
> Regenerated local transaction journal transactions=0 accounts=0
> Regenerated local transaction journal transactions=0 accounts=0

Experimenting the same issue with 6 sealers, i restarted each node but now im get stucked in

> INFO [01-07|18:17:30.645] Etherbase automatically configured       address=0x5Bc69DC4dba04b6955aC94BbdF129C3ce2d20D34
> INFO [01-07|18:17:30.645] Commit new mining work                   number=488677 sealhash=a506ec…8cb403 uncles=0 txs=0 gas=0 fees=0 elapsed=133.76µs
> INFO [01-07|18:17:30.645] Signed recently, must wait for others 
> 

The first thing that is weird is that, some nodes are stucked on the 488677 and others are on 488676, this behaviour was reported on this issue https://github.com/ethereum/go-ethereum/issues/16406 same for the user @lyhbarry 

Example:
Signer 1

![image](https://user-images.githubusercontent.com/14795944/50795139-09d2de00-12ac-11e9-9923-521af5cfe530.png)

Signer 2

![image](https://user-images.githubusercontent.com/14795944/50795132-03446680-12ac-11e9-985e-e9d534b26e0e.png)

**Note that there is no votes pending**

So, right now, i shut down and restar each node, i have found that:

- Each node is paired with the others
- Each node is part of clique.getSigners()
- Each node is waiting for another to sign...

```
INFO [01-07|18:41:56.134] Signed recently, must wait for others 
INFO [01-07|19:41:42.125] Regenerated local transaction journal    transactions=0 accounts=0
INFO [01-07|18:41:56.134] Signed recently, must wait for others 
```

**So, the syncronization fail but also i just can start signing again because each node is stucked waiting for the others, that means, the network is useless?**

The comment of @tudyzhb on that issue mention that:

> Ref clique-seal of v1.8.11, I think there is no an effective mechanism to retry seal, when an in-turn/out-of-turn seal fail occur. So our dev network useless easily.

After this problem, i take a look at the logs, each signer has this error messages:

`Synchronisation failed, dropping peer    peer=7875a002affc775b err="retrieved hash chain is invalid"`



```
INFO [01-02|16:42:10.902] Signed recently, must wait for others 
WARN [01-02|16:42:11.960] Synchronisation failed, dropping peer    peer=7875a002affc775b err="retrieved hash chain is invalid"
INFO [01-02|16:42:12.128] Imported new chain segment               blocks=1  txs=0 mgas=0.000 elapsed=540.282µs mgasps=0.000  number=488116 hash=269920…afd3c7 cache=5.99kB
INFO [01-02|16:42:12.129] Commit new mining work                   number=488117 sealhash=f7b00c…787d5c uncles=2 txs=0 gas=0     fees=0          elapsed=307.314µs
INFO [01-02|16:42:20.929] Successfully sealed new block            number=488117 sealhash=f7b00c…787d5c hash=f17438…93ffe3 elapsed=8.800s
INFO [01-02|16:42:20.929] 🔨 mined potential block                  number=488117 hash=f17438…93ffe3
INFO [01-02|16:42:20.930] Commit new mining work                   number=488118 sealhash=b09b33…1526ba uncles=2 txs=0 gas=0     fees=0          elapsed=520.754µs
INFO [01-02|16:42:20.930] Signed recently, must wait for others 
INFO [01-02|16:42:31.679] Imported new chain segment               blocks=1  txs=0 mgas=0.000 elapsed=2.253ms   mgasps=0.000  number=488118 hash=763a32…a579f5 cache=5.99kB
INFO [01-02|16:42:31.680] 🔗 block reached canonical chain          number=488111 hash=3d44dc…df0be5
INFO [01-02|16:42:31.680] Commit new mining work                   number=488119 sealhash=c8a5e7…db78a1 uncles=2 txs=0 gas=0     fees=0          elapsed=214.155µs
INFO [01-02|16:42:31.680] Signed recently, must wait for others 
INFO [01-02|16:42:40.901] Imported new chain segment               blocks=1  txs=0 mgas=0.000 elapsed=808.903µs mgasps=0.000  number=488119 hash=accc3f…44bc4c cache=5.99kB
INFO [01-02|16:42:40.901] Commit new mining work                   number=488120 sealhash=f73978…c03fa7 uncles=2 txs=0 gas=0     fees=0          elapsed=275.72µs
INFO [01-02|16:42:40.901] Signed recently, must wait for others 
WARN [01-02|16:42:41.961] Synchronisation failed, dropping peer    peer=7875a002affc775b err="retrieved hash chain is invalid"
```

I also see some: 

`INFO [01-02|16:58:10.902] 😱 block lost                             number=488205 hash=1fb1c5…a41a42
`
This error about hash chain was just a warning, so the node keep mining until the 2th of january, then i saw this **on each of the 6 nodes**

![image](https://user-images.githubusercontent.com/14795944/50795604-8619f100-12ad-11e9-8786-52a2e4290ba4.png)


I was looking that there are a lot of issues about this error, the most similar is the one i posted here but is unresolved. 

Most of the issues workarrounds seems to be a restart, but in this case, the chain seems to be is in a unconsistent state and the nodes are always waiting for the others

So, 

1. any ideas? peers are connected, accounts are unlocked, it just entered into a deadlock situation after 450k blocks
2. any logs that i can provide? i only see the warnings of the error described and the block lost, but nothing when the node stoped to be mining
3. Is this PR related? https://github.com/ethereum/go-ethereum/pull/18072
4. Maybe is related with the comment of @karalabe  onthis issue https://github.com/ethereum/go-ethereum/issues/16406? 
5 Upgrading from 1.8.17 to 1.8.20 will solve this?
6. In my opinion, seems like a race condition or something, since i have 2 chains, one running for 2 months, the other one for three months and is the first time this error happens 

This are other related issues:

https://github.com/ethereum/go-ethereum/issues/16444 (Same issue but i dont have votes pending in my snapshot)

https://github.com/ethereum/go-ethereum/issues/14381#

https://github.com/ethereum/go-ethereum/issues/16825

https://github.com/ethereum/go-ethereum/issues/16406


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PoA network, all the sealers are waiting for each other after 2 months running, possible deadlock? #18402

System information

Expected behaviour

Actual behaviour

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PoA network, all the sealers are waiting for each other after 2 months running, possible deadlock? #18402

Description

System information

Expected behaviour

Actual behaviour

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions