Skip to content
This repository was archived by the owner on Sep 8, 2025. It is now read-only.

Conversation

@carver
Copy link
Contributor

@carver carver commented Nov 20, 2019

Exploring a slight variant on #1874

So far, I haven't found any dealbreakers. This is obviously not finished, but I have to wrap for the night, and wanted to share the concept.

I think there is eventually a path to encapsulating everything into the consensus engine, but I don't think it's strictly necessary in this path. (where we can still do some patching)

cc @cburgdorf for 👀


@cached_property
def _consensus(self) -> ConsensusAPI:
return self.consensus_class(self.chaindb.db)
Copy link
Contributor

@cburgdorf cburgdorf Nov 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this mean each VM would get it's own instance of consensus_class? That would be tragic as the ConsensusAPI can be stateful (Clique keeps an in-memory cache). We currently do not have a test case for this but it would be good to add one. It should be easy to modify this to keep consensus state in memory and mine blocks that pass over VM boundaries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and I think it is important to figure out how to make it stateless. Everything else that might change in a fork (in the VM) can be built up and torn down on demand, which we lean on heavily to ensure smooth fork transitions. Managing a change in Clique at a block boundary will be a nightmare if it requires some persistent in-memory store.

There must be some way for Clique to load the data from disk. How long does that take in practice?

Copy link
Contributor

@cburgdorf cburgdorf Nov 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmmh..ok let's see. So, there are multiple places of caching in the current clique implementation but let's ignore the one's that are for performance and look at the HeaderCache which I think is the problematic one.

To recall, in Clique you can not validate the seal of a header unless you have access to the previous header. Let's assume we validate a series of headers via validate_chain(...). These headers may span over multiple VMs and result in a validate_seal(...) call on each of these VMs. By the time that we validate the seal of the second header (or third, fourth, ns) the previous header won't be in the database yet (after all we are just in the middle of validating a whole series so none of it will get to the db until we've validated the entire series).

So in order to be able to look it up on subsequent validate_seal calls, we first put it into our HeaderCache when validate_seal is called.

def validate_seal(self, header: BlockHeaderAPI) -> None:
"""
Validate the seal of the given ``header`` according to the Clique consensus rules.
"""
if header.block_number == 0:
return
validate_header_integrity(header, self._epoch_length)
self._header_cache[header.hash] = header

We later flush that as soon as we know that the headers are in the db (we actually keep the last 1000 in memory to account for temporary forks).

def _lookup_header(self, block_hash: Hash32) -> BlockHeader:
if block_hash in self._header_cache:
return self._header_cache[block_hash]
try:
return self._chain_db.get_block_header_by_hash(block_hash)
except HeaderNotFound:
raise ValidationError("Unknown ancestor %s", block_hash)

The other way to fix that that I see is to overwrite validate_chain with a clique-specific implementation because that method has access to the entire series but then this would make the consensus engine operate on the chain level again.


chain = MiningChain.configure(
vm_configuration=vms,
vm_configuration=clique_vms,
Copy link
Contributor

@cburgdorf cburgdorf Nov 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I guess we would then also have a NoConsensusApplier and PowConsensusApplier and the canonical way to activate a chain for a specific type of consensus would be:

Chain.configure(vm_configuraton=SomeApplier.amend_vm_configuration(vms))

But as I noted in a comment above, this setup seems to be setup to have each VM maintain their own ConsensusAPI instance which only seems to work if the ConsensusAPI were stateless.

Copy link
Contributor Author

@carver carver Nov 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I guess we would then also have a NoConsensusApplier and PowConsensusApplier and the canonical way to activate a chain for a specific type of consensus would be:

We don't seem to need it, because it's so trivial to just set the consensus class for the others, but I'm not really opposed. I think we can have a SimpleConsensusApplier that just changes the consensus_class and is sufficient for both PoW and NoProof.

which only seems to work if the ConsensusAPI were stateless.

Yeah, I'm not ready to give up on stateless VMs yet.

for block_number, vm in chain_class.vm_configuration
)
return chain_class.configure(
vm_configuration=no_pow_vms,
Copy link
Contributor Author

@carver carver Nov 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this seems like a nice place to use that DefaultConsensusApplier concept.

@carver
Copy link
Contributor Author

carver commented Nov 20, 2019

I split it out into two commits, the first should be an unchanged version of #1874.

So now it should be easier to see what the changes I'm proposing are.

@carver carver force-pushed the clique-in-vm branch 5 times, most recently from 7655acd to 2ce1176 Compare November 20, 2019 23:26
@carver carver changed the title [Concept] Moving consensus engine into VM [WIP] Moving consensus engine into VM Nov 20, 2019
@carver
Copy link
Contributor Author

carver commented Nov 20, 2019

Okay, so now the move to the VM is isolated to 2ce1176 .

The only piece I got stuck on is that somehow the state root changed. The balances of the relevant accounts are the same, I also checked that the balance at the 0 address was empty. So it's not clear what happened there.

Two main questions left in my mind:

  • what's the deal with the state root changing -- (Edit: I got it -- it was that the coinbase delta_balance was triggering a touch. I posted the fix)
  • what's the cost of losing the cross-block cache

@carver
Copy link
Contributor Author

carver commented Nov 20, 2019

  • what's the deal with the state root changing

It looks like it's "touching" the account at 0, so it previously didn't exist, and then afterwards it exists as an empty account. I'm not sure why this PR would have that effect.

Edit: Yup, I was just able to confirm that if you run the test with only the first commit, then account 0 does not get touched.

@carver carver changed the title [WIP] Moving consensus engine into VM Moving consensus engine into VM Nov 21, 2019
@carver carver requested a review from cburgdorf November 21, 2019 00:02
@cburgdorf
Copy link
Contributor

cburgdorf commented Nov 21, 2019

I generally like the direction. However, I'm pretty sure this will crash when we validate a series of headers across VM boundaries.

what's the cost of losing the cross-block cache

You might have missed my in-line comment so let me put it here again.

So, there are multiple places of caching in the current clique implementation but let's ignore the one's that are for performance and look at the HeaderCache which I think is the problematic one.

To recall, in Clique you can not validate the seal of a header unless you have access to the previous header. Let's assume we validate a series of headers via validate_chain(...). These headers may span over multiple VMs and result in a validate_seal(...) call on each of these VMs. By the time that we validate the seal of the second header (or third, fourth, ns) the previous header won't be in the database yet (after all we are just in the middle of validating a whole series so none of it will get to the db until we've validated the entire series).

So in order to be able to look it up on subsequent validate_seal calls, we first put it into our HeaderCache when validate_seal is called.

def validate_seal(self, header: BlockHeaderAPI) -> None:
"""
Validate the seal of the given ``header`` according to the Clique consensus rules.
"""
if header.block_number == 0:
return
validate_header_integrity(header, self._epoch_length)
self._header_cache[header.hash] = header

We later flush that as soon as we know that the headers are in the db.

def _lookup_header(self, block_hash: Hash32) -> BlockHeader:
if block_hash in self._header_cache:
return self._header_cache[block_hash]
try:
return self._chain_db.get_block_header_by_hash(block_hash)
except HeaderNotFound:
raise ValidationError("Unknown ancestor %s", block_hash)

The other way to fix that that I see is to overwrite validate_chain with a clique-specific implementation because that method has access to the entire series but then this would make the consensus engine operate on the chain level again. 🤷‍♂️

@cburgdorf
Copy link
Contributor

If we have the ConsensusAPI become a chain level thing again and then replace validate_chain. This will mean we do not have to touch validate_seal on the VM at all but it just ends up not being used. The replaced validate_chain method on the chain would directly talk to the consensus_engine to validate the seals and has access to the entire series of headers.

It does also mean that you could not directly call vm.validate_seal(...) under the rules of CliqueConsensus because applying the consensus engine simply begins on chain level rather than vm level. I'm not sure if this is any better but I don't see how else to deal with it. The fundamental problem is simple: In Clique consensus, validation needs to happen sequentially and needs to be able to lookup parent headers during validation. How can this assumption hold true if we try to validate headers in a vacuum? If we validate a series of headers I need to have access to that series which validate_seal naturally doesn't because it is just validating a single header and the in-flight series of headers can not be looked up from the database.

@carver
Copy link
Contributor Author

carver commented Nov 25, 2019

In Clique consensus, validation needs to happen sequentially and needs to be able to lookup parent headers during validation. How can this assumption hold true if we try to validate headers in a vacuum? If we validate a series of headers I need to have access to that series which validate_seal naturally doesn't because it is just validating a single header and the in-flight series of headers can not be looked up from the database.

Good point! In fact, it looks like neither the current in-VM solution nor the current in-Chain solution will work with validate_chain. It is called on random pairs of headers deep in the future chain of headers (during skeleton sync), before its parents have been downloaded or validated.

So here's what I'm thinking:

  • We clarify in validate_chain docs that it is designed to be used even if parent headers from the chain are not available (which is the current assumption that callers make). Any seal check run inside must be possible with that constraint. (for example, we could do simple checks of Clique, like difficulty in {1, 2})
  • We add a new Chain.validate_extension(root, headers) that is similar to validate_chain but can only be called when the root header is present in the database.
    • validate_extension() is empty for PoW, because all the work can be done in validate_chain
    • almost all of the Clique logic would live in validate_extension(), since it requires the presence of parents
  • Add a new VM.validate_header_extension(child, parents) would be similar to VM.validate_header(), but the oldest parent must be persisted to the database. This is what Chain.validate_extension() would call. Clique would have access to all the parents in memory, and everything not in memory would be in the DB.

@cburgdorf
Copy link
Contributor

it looks like neither the current in-VM solution nor the current in-Chain solution will work with validate_chain.

Well, yes, that's why on the Trinity side we differentiate between VALIDATE_ON_ARRIVAL and VALIDATE_AFTER_STITCHING.

We add a new Chain.validate_extension(root, headers) that is similar to validate_chain but can only be called when the root header is present in the database.

Your last commit defines Chain.validate_extension(...) as:

def validate_extension(
            self,
            new_header: BlockHeaderAPI,
            check_seal: bool = True) -> None:

So I guess there's a conflict between what you explained and the code in that commit. I'm going to assume validate_extension would have access to batch of parent headers needed to fulfill the validation.

validate_chain but can only be called when the root header is present in the database.

I'm also not entirely sure why it needs to be in the database at this point. Isn't the only important thing that the VM has access to it? Shouldn't in-memory access be enough?

And am I right to assume that on the Trinity side, instead of messing with full_rate / sample_rate for validate_chain we would instead just call it as we do today but it would translate into a noop for clique. And Chain.validate_extension() would continue to be called after stitching.

I'm a bit skeptical if we really need all the validate_extension* APIs. I also haven't seen other clients make this distinction.

Here's another proposal:

We could change VM.validate_header from what it is today 👇

    def validate_header(cls,
                        header: BlockHeaderAPI,
                        parent_header: BlockHeaderAPI,
                        check_seal: bool = True
                        ) -> None:

To this 👇 :

    def validate_header(cls,
                        header: BlockHeaderAPI,
                        parent_headers: Iterable[BlockHeaderAPI],
                        check_seal: bool = True
                        ) -> None:

And then in Chain.validate_chain(...) we call vm_class.validate_header(...) with all parent headers instead of just the immediate one. This would allow to satisfy Clique on the VM level without introducing these new APIs.

Ok, I have to admit, I have a hard time to think it all through without actually playing with the code and it took me a ridiculous long time to even write this response. Tomorrow I'm going to play with the code to see where I end up 🙃

@cburgdorf
Copy link
Contributor

Ok, I think I'm starting to understand what you meant with:

but the oldest parent must be persisted to the database

But I want to make sure to also understand the consequences of that. Here's what I did now:

  1. I took your PR without the last commit and drafted something that:
  • Adds a test that calls validate_chain with headers spanning across multiple forks
  • Removes the HeaderCache entirely
  • changes the validate_seal API to optionally accept cached parent headers

138e253

I'm not proposing this as the API but it felt like the most ad hoc thing I could do to let me play with the general idea of getting rid of the HeaderCache and hence going fully stateless.

  1. Next, I updated Trinity to use that: ethereum/trinity@b22a3a9

  2. I tried running trinity --goerli --sync-mode full to see if it works in practice

It doesn't and I guess that is where your comment about "the oldest parent must be persisted to the database" comes into play.

 WARNING  2019-11-27 11:36:22,843                    VM  Failed to validate header proof of work on header: {'parent_hash': b'\xf6\xdaq\x8d%?}&\xbc\xb2\x83\xe3\xca\x1f\xb6\xf1\xf0\xe7\x0c\x0b\xaa\xfa\xa21\xd0]\x03\t\x02\xac\xac\xca', 'uncles_hash': b'\x1d\xccM\xe8\xde\xc7]z\xab\x85\xb5g\xb6\xcc\xd4\x1a\xd3\x12E\x1b\x94\x8at\x13\xf0\xa1B\xfd@\xd4\x93G', 'coinbase': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'state_root': b']l\xde\xd5\x85\xe7<N2,0\xc2\xf7\x82\xa361o\x17\xdd\x85\xa4\x86;\x9d\x83\x8d-K\x8b0\x08', 'transaction_root': b'V\xe8\x1f\x17\x1b\xccU\xa6\xff\x83E\xe6\x92\xc0\xf8n[H\xe0\x1b\x99l\xad\xc0\x01b/\xb5\xe3c\xb4!', 'receipt_root': b'V\xe8\x1f\x17\x1b\xccU\xa6\xff\x83E\xe6\x92\xc0\xf8n[H\xe0\x1b\x99l\xad\xc0\x01b/\xb5\xe3c\xb4!', 'bloom': 0, 'difficulty': 2, 'block_number': 194, 'gas_limit': 8675502, 'gas_used': 0, 'timestamp': 1548950348, 'extra_data': b'Parity Tech Authority\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x900Bkq9\x04C\x0fg\x15\xf7\xd5\xaal\x8dop\x9e\x9a\x9e.\xfd\xe35\xa6\xaf\xdd\xa9\xd2\x8d7c\x1d\x04\x1c\x9fw##\x99\x03\tN\xac\xf8\xa1Z\xa3X\xa82\xcd\xf3}j7\x88 "\xf9\xces\x80\x01', 'mix_hash': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'nonce': b'\x00\x00\x00\x00\x00\x00\x00\x00'}
 WARNING  2019-11-27 11:36:22,843  ETHHeaderChainSyncer  Received invalid header. Starting over. <BlockHeader #194 78a33337> is not a valid child of <BlockHeader #193 f6da718d>: ('Unknown ancestor %s', '0xb8525143bcacafe99b5da3e8406bc98605a66e763a0339b2e9352e02628090f8')
    INFO  2019-11-27 11:36:22,860  RegularChainBodySyncer  Imported block 1 (0 txs) in 0.04 seconds, with 9m3w5d lag
    INFO  2019-11-27 11:36:22,871  RegularChainBodySyncer  Imported block 2 (0 txs) in 0.01 seconds, with 9m3w5d lag
    INFO  2019-11-27 11:36:22,886  RegularChainBodySyncer  Imported block 3 (0 txs) in 0.01 seconds, with 9m3w5d lag
    INFO  2019-11-27 11:36:22,904  RegularChainBodySyncer  Imported block 4 (0 txs) in 0.02 seconds, with 9m3w5d lag
    INFO  2019-11-27 11:36:22,925  RegularChainBodySyncer  Imported block 5 (0 txs) in 0.02 seconds, with 9m3w5d lag
    INFO  2019-11-27 11:36:25,914  RegularChainBodySyncer  Imported block 31 (0 txs) in 0.10 
...
(goes on until 192)

If I interpret that correctly it means that the second batch of headers arrives and we are calling validate_chain at a time when the first batch didn't make it into the database yet.

This wasn't a problem in the previous non-stateless model because the HeaderCache made sure to keep the headers around that it had seen until those were finally hitting the database.

So the thing that I'm wondering about is how this would now affect the syncing code. Especially because waiting for the blocks to be imported is the slow part of the sync. Or would we persist the headers immediately after they were validated?

@carver
Copy link
Contributor Author

carver commented Nov 27, 2019

So I guess there's a conflict between what you explained and the code in that commit. I'm going to assume validate_extension would have access to batch of parent headers needed to fulfill the validation.

So what I decided while I was fooling around with the code is that most of the time the validation can happen one header at the time, so the multi-parent thing is unnecessary. We can just validate it at import_block time, one header at a time.

... But we might want to formalize the header-only persist use-case (which we are hacking around for two different scenarios already: Beam Sync and Light Chains). That often persists multiple headers at once. So we would probably want to validate an extension of multiple headers at once.

Even then, we wouldn't need to supply multiple parents at the Chain API level (because we are still presuming that the parents of the headers to import are present), but we would need multiple parents at the VM level. Because, presumably, we would validate the whole series of headers before persisting any of them, in order to keep the benefit of batching the writes to the database. So we would have something like:

class Chain: 
    def import_headers(self, headers):
        # this only imports headers, trying to access block bodies will still crash after this completes

        for index, header in enumerate(headers):
            vm = self.get_vm(header)

            # pass in any parents that are not already in the database
            parents = headers[:index]
            vm.validate_extension(parents, header)
            
        self.chaindb.persist_header_chain(headers) 

@carver
Copy link
Contributor Author

carver commented Nov 27, 2019

  • Adds a test that calls validate_chain with headers spanning across multiple forks

I think we'll need a test that validate_chain doesn't crash when validating headers 10-15 when only headers 0-5 are present in the database.

That's why I think it's preferable to add a new validate method that can do stricter validations that only work when all its parents are available, because there are certain checks we simply can't reliably run in the above scenario (like the Clique checks).

@carver
Copy link
Contributor Author

carver commented Jan 15, 2020

Handled in #1899

@carver carver closed this Jan 15, 2020
@carver carver deleted the clique-in-vm branch January 15, 2020 23:33
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants