Skip to content

PoA network, all the sealers are waiting for each other after 2 months running, possible deadlock? #18402

@marcosmartinez7

Description

@marcosmartinez7

System information

My current version is:

Geth
Version: 1.8.17-stable
Git Commit: 8bbe72075e4e16442c4e28d999edee12e294329e
Architecture: amd64
Protocol Versions: [63 62]
Network Id: 1
Go Version: go1.10.1
Operating System: linux
GOPATH=
GOROOT=/usr/lib/go-1.10

Expected behaviour

Keep the normal signing .

Actual behaviour

I was running a go-ethereum private network with 6 sealers.

Each sealer is run by:

directory=/home/poa
command=/bin/bash -c 'geth --datadir sealer4/  --syncmode 'full' --port 30393 --rpc --rpcaddr 'localhost' --rpcport 8600 --rpcapi='net,web3,eth' --networkid 30 --gasprice '1' -unlock 'someaddress' --password sealer4/password.txt --mine '

The blockchain was running good for about 1-2 months.

Today i found that all the nodes were having issues. Each node was emmiting the message "Signed recently, must wait for others"

I check out the logs and i found this message every 1 hour, no more information, the nodes where not mining:

Regenerated local transaction journal transactions=0 accounts=0
Regenerated local transaction journal transactions=0 accounts=0
Regenerated local transaction journal transactions=0 accounts=0
Regenerated local transaction journal transactions=0 accounts=0

Experimenting the same issue with 6 sealers, i restarted each node but now im get stucked in

INFO [01-07|18:17:30.645] Etherbase automatically configured address=0x5Bc69DC4dba04b6955aC94BbdF129C3ce2d20D34
INFO [01-07|18:17:30.645] Commit new mining work number=488677 sealhash=a506ec…8cb403 uncles=0 txs=0 gas=0 fees=0 elapsed=133.76µs
INFO [01-07|18:17:30.645] Signed recently, must wait for others

The first thing that is weird is that, some nodes are stucked on the 488677 and others are on 488676, this behaviour was reported on this issue #16406 same for the user @lyhbarry

Example:
Signer 1

image

Signer 2

image

Note that there is no votes pending

So, right now, i shut down and restar each node, i have found that:

  • Each node is paired with the others
  • Each node is part of clique.getSigners()
  • Each node is waiting for another to sign...
INFO [01-07|18:41:56.134] Signed recently, must wait for others 
INFO [01-07|19:41:42.125] Regenerated local transaction journal    transactions=0 accounts=0
INFO [01-07|18:41:56.134] Signed recently, must wait for others 

So, the syncronization fail but also i just can start signing again because each node is stucked waiting for the others, that means, the network is useless?

The comment of @tudyzhb on that issue mention that:

Ref clique-seal of v1.8.11, I think there is no an effective mechanism to retry seal, when an in-turn/out-of-turn seal fail occur. So our dev network useless easily.

After this problem, i take a look at the logs, each signer has this error messages:

Synchronisation failed, dropping peer peer=7875a002affc775b err="retrieved hash chain is invalid"

INFO [01-02|16:42:10.902] Signed recently, must wait for others 
WARN [01-02|16:42:11.960] Synchronisation failed, dropping peer    peer=7875a002affc775b err="retrieved hash chain is invalid"
INFO [01-02|16:42:12.128] Imported new chain segment               blocks=1  txs=0 mgas=0.000 elapsed=540.282µs mgasps=0.000  number=488116 hash=269920…afd3c7 cache=5.99kB
INFO [01-02|16:42:12.129] Commit new mining work                   number=488117 sealhash=f7b00c…787d5c uncles=2 txs=0 gas=0     fees=0          elapsed=307.314µs
INFO [01-02|16:42:20.929] Successfully sealed new block            number=488117 sealhash=f7b00c…787d5c hash=f17438…93ffe3 elapsed=8.800s
INFO [01-02|16:42:20.929] 🔨 mined potential block                  number=488117 hash=f17438…93ffe3
INFO [01-02|16:42:20.930] Commit new mining work                   number=488118 sealhash=b09b33…1526ba uncles=2 txs=0 gas=0     fees=0          elapsed=520.754µs
INFO [01-02|16:42:20.930] Signed recently, must wait for others 
INFO [01-02|16:42:31.679] Imported new chain segment               blocks=1  txs=0 mgas=0.000 elapsed=2.253ms   mgasps=0.000  number=488118 hash=763a32…a579f5 cache=5.99kB
INFO [01-02|16:42:31.680] 🔗 block reached canonical chain          number=488111 hash=3d44dc…df0be5
INFO [01-02|16:42:31.680] Commit new mining work                   number=488119 sealhash=c8a5e7…db78a1 uncles=2 txs=0 gas=0     fees=0          elapsed=214.155µs
INFO [01-02|16:42:31.680] Signed recently, must wait for others 
INFO [01-02|16:42:40.901] Imported new chain segment               blocks=1  txs=0 mgas=0.000 elapsed=808.903µs mgasps=0.000  number=488119 hash=accc3f…44bc4c cache=5.99kB
INFO [01-02|16:42:40.901] Commit new mining work                   number=488120 sealhash=f73978…c03fa7 uncles=2 txs=0 gas=0     fees=0          elapsed=275.72µs
INFO [01-02|16:42:40.901] Signed recently, must wait for others 
WARN [01-02|16:42:41.961] Synchronisation failed, dropping peer    peer=7875a002affc775b err="retrieved hash chain is invalid"

I also see some:

INFO [01-02|16:58:10.902] 😱 block lost number=488205 hash=1fb1c5…a41a42
This error about hash chain was just a warning, so the node keep mining until the 2th of january, then i saw this on each of the 6 nodes

image

I was looking that there are a lot of issues about this error, the most similar is the one i posted here but is unresolved.

Most of the issues workarrounds seems to be a restart, but in this case, the chain seems to be is in a unconsistent state and the nodes are always waiting for the others

So,

  1. any ideas? peers are connected, accounts are unlocked, it just entered into a deadlock situation after 450k blocks
  2. any logs that i can provide? i only see the warnings of the error described and the block lost, but nothing when the node stoped to be mining
  3. Is this PR related? les: fix fetcher syncing logic #18072
  4. Maybe is related with the comment of @karalabe onthis issue Geth signing stops after a period of time #16406?
    5 Upgrading from 1.8.17 to 1.8.20 will solve this?
  5. In my opinion, seems like a race condition or something, since i have 2 chains, one running for 2 months, the other one for three months and is the first time this error happens

This are other related issues:

#16444 (Same issue but i dont have votes pending in my snapshot)

#14381

#16825

#16406

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions