Skip to content
This repository was archived by the owner on Aug 2, 2021. It is now read-only.

Conversation

@nonsense
Copy link
Contributor

This PR is:

  1. Fixing message handlers to run asynchronously so that we don't block the TCP socket for a given peer.
  2. Fixing the signature for peer.Drop, which should not take an error as it doesn't do anything with that argument.
  3. Remove dead code - SwarmChunkServer
  4. Remove dead code - partly removing skipCheck - this will be a longer effort
  5. Remove the RETRIEVE_REQUEST stream, which is not used in any production code for anything.
  6. Remove tests that somehow expect an OfferedHashes messages after a RetrieveRequest? This is definitely not how master code has been working for months.

case *ChunkDeliveryMsgSyncing:
msg = (*ChunkDeliveryMsg)(r)
mode = chunk.ModePutSync
case *ChunkDeliveryMsg:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the tests is actually sending a ChunkDeliveryMsg and not the other two specific types, therefore failing.

This is not code that is hit in production, but I decided to add this here temporarily, rather than dig into the test.

It is not clear why this test was not failing with the previous code.

)
streamer.delivery.RequestFromPeers(ctx, req)

stream := NewStream(swarmChunkServerStreamName, "", true)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SwarmChunkServer doesn't exist anymore, and RETRIEVE_REQUEST doesn't exist anymore.

Also this is a huge test, which only checks that if we issue a request to RequestFromPeers, we get specific messages back - SubscribeMsg for the RETRIEVE_REQUEST stream, and a RetrieveRequestMsg - this is rather useless for two reasons:

  1. We are locking behaviour that doesn't provide much value.
  2. We are testing a very basic functionality, which is hit 1000 times per a single integration test.

No need to lock this behaviour in such a complicated way. Not to mention that RETRIEVE_REQUEST stream serves no purpose in any production code for a long time.

Code: 6,
Msg: &ChunkDeliveryMsg{
Addr: hash,
SData: hash,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did that work before is beyond me... We don't have any validation for ChunkDeliveryMsg ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

further up in localstore i guess there is. not on msg level

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, there is validation in localstore, i just found this confusing here. now that you mention it, it makes sense.

Copy link
Member

@janos janos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of work in this PR, thanks Anton.

return pstreams
}

func (api *API) GetPeerClientSubscriptions() map[string][]string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe to write a test for this function as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is trivial, just like the GetPeerServerSubscriptions. I don't think we need tests for it, chances that someone will refactor it and break it are so slim, that I'd rather we don't add 50 lines of tests like the other one.

Also this is not an API call that is part of critical functionality, it is here for debug purposes.

If we add unit test for this, we will probably have to remove it, once we review the Stream implementation - something that has been in discussions over the last weeks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it is not 50 lines, but much more, we also have TestGetServerSubscriptionsRPC

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestGetServerSubscriptionsRPC depends on:

  • simulations package
  • adapters
  • netStore
  • delivery
  • spec for SubscribeMsg
  • json snapshots
  • exact naming of the function, not covered by the compiler by dynamically linked

I'd rather we don't write such low-value tests for such trivial functionalities.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, tools that can be used for testing the code should not be relevant for decision if the code should be tested. If this code exists and somebody refactors the stream package, I suppose that it would be nice to have a test to verify that there is no regression. I would not block this PR for not testing this code if it is not an issue for others.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janos I strongly disagree with this.

If you are using 10 libraries to test a for loop that is adding values to a slice, you are increasing the complexity of the codebase for no good reason.

Now that these tests are part of the codebase, every time we change any of those 10 libraries, we also have to modify the tests. This is a cost to development that we should not ignore. The benefit of this test is that we are sure that we have not broken the for loop that is adding values to a slice - rather low-value in my opinion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explanations. I will not insist on this test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janos giving a bit more thought to this particular piece of code - the sole purpose of this API call exists is to be a helper method to debug the already broken subscriptions implementation.

I'd rather we focus on solving the bugs in the subscriptions implementation (without commenting what this involves - rewrite or bug fix or whatever), than write tests for the helper methods we add to the codebase to help us track the bigger issues in Swarm.

This feels like adding a test for a metrics counter or timer to me.

So if we insist to have a test on this API (which I see you don't in the comment above), I'd rather just remove it altogether, and find another way to debug the subscriptions functionality.

Copy link
Member

@janos janos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments Anton.

@nonsense
Copy link
Contributor Author

@janos thanks for thorough review and for putting up with my rushed PR!

Copy link
Member

@zelig zelig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure light node functionality does not regress here?

changes here are maybe too big, too hasty ?

// TODO: may need to implement protocol drop only? don't want to kick off the peer
// if they are useful for other protocols
func (p *Peer) Drop(err error) {
func (p *Peer) Drop() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we please preserve the error and log it or even put it in p2p.DiscSubprotocolError?
this is losing too much contextual info

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree this could be converted to:

func (p *Peer) Drop(err error) {
 if err == nil{ err = p2p.DiscSubprotocolError } 
 p.Disconnect(err)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could log the error, but I'd rather not propagate it down to p2p, and replace DiscSubprotocolError just now, as this is changing behaviour, and we already have too many changes.

The current change preserves the behavior we already have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justelad I am not changing this behavior as well in this PR, let's keep it as small as possible.

Copy link
Member

@zelig zelig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is removing the retrieve request stream justified because in current production code skipcheck is true. There was a reason why skipcheck was a (tested) option for retrieve request stream: we wanted to check if in some scenarios, chunk delivery for retrieve requests should go via the offered/wanted hash roundtrip. While this may introduce latency, without it multiple successful requests will lead to multiple deliveries.

At this point I am leaning towards handling the latter issue with request cancellation #409 which also has the chance to cancel requests for all forwarding nodes

In the light of these changes, the question arises: why are retrieve requests even part of the stream package and stream protocol. it would make sense to remove request handling entirely from stream, put it under network and make it part of bzz protocol directly.

@nonsense
Copy link
Contributor Author

nonsense commented Apr 12, 2019

@zelig good question. I think code that is not used by the users, but only hit in unit tests, has no place in the codebase as a general rule of thumb.

Currently we have:

  • RETRIEVE_REQUEST stream (retrieve requests handled through an offered/wanted hashes flow?),
  • SwarmChunkServer
  • skipCheck
  • priority queue
  • TakeoverProof

None of this is necessary for core Swarm functionality and is increasing the carry cost of development, and making code very difficult to reason about. This is the primary reason we have had bugs in the most simple use-cases of Swarm - namely upload a file and try to fetch it from a different node in the network.

To summarize, the bugs we've found during this effort, that were 'hidden' due to this non-used functionality:

  • blocking TCP sockets, hidden behind peer.Drop() and hidden behind priority queue sends.
  • an order of magnitude more ChunkDelivery messages due to the offered-hashes/wanted-hashes protocol not keeping track of them
  • (not a known bug, but very convoluted implementation of the default RetrieveRequest flow)
  • impossibility to add instrumentation and see Swarm behavior in a real network due to priority queues and convoluted use of contexts.

Because of all this I think it is justified to remove this dead code.


Regarding the light client - it is possible that this is breaking a very small part of it, but fixing it should be rather simple.

@nonsense
Copy link
Contributor Author

RetrieveRequests in the simplified version of the code, have nothing to do with Streams. The only thing they share is the Fetchers, so that we have only 1 request per chunk, no matter if this request was done through a RetrieveRequest or through Syncing.

You have a point that this doesn't even belong in this package, but I have a hard time getting even simpler changes in, so I'd rather we change this gradually.

@janos
Copy link
Member

janos commented Apr 12, 2019

Moving retrieve requests from the stream package is a very good idea. It would just need a different protocol as it is now part of the stream protocol specification.

@nonsense
Copy link
Contributor Author

Yes, @janos, that's correct. Let's postpone this until we have all the known bug fixes integrated in the codebase, together with a solution for the subscription bug. Refactoring RetrieveRequests out of the Stream protocol is rather low priority right now, in my opinion.

@janos
Copy link
Member

janos commented Apr 12, 2019

I agree, @nonsense.

Copy link
Contributor

@acud acud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nonsense great work. i left a few minor comments

// TODO: may need to implement protocol drop only? don't want to kick off the peer
// if they are useful for other protocols
func (p *Peer) Drop(err error) {
func (p *Peer) Drop() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree this could be converted to:

func (p *Peer) Drop(err error) {
 if err == nil{ err = p2p.DiscSubprotocolError } 
 p.Disconnect(err)
}


// This test checks the default behavior of the server, that is
// when it is serving Retrieve requests.
func TestLigthnodeRetrieveRequestWithRetrieve(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the registryOptions is supposed to indicate the this is a light node behavior but I see what you mean, it is not so clear what is being tested here

return p.handleOfferedHashesMsg(ctx, msg)
go func() {
err := p.handleOfferedHashesMsg(ctx, msg)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am good with leaving these functions as is right now for clarity.
if we keep the error (and actually use it) in peer.Drop we could drop the log because the error will be automatically logged on the peer drop

@nonsense
Copy link
Contributor Author

There are many discussions about peer.Drop() in this PR. I want to highlight that we are failing with developing Swarm in the most basic use-case when all the network is of trusted nodes, that adhere to the protocols.

Adding peer.Drop() at various locations and reconnecting right after that to the same peer is already broken and wrong. If the peer is misbehaving, why are we re-connecting to them right after we drop them?

Additionally peer.Drop() is a very insidious way to cover for deadlocks and logical problems with the protocols - if a peer is misbehaving (read deadlocked) and it times-out, we just disconnect and reconnect and move on, essentially solving the deadlock at an expense of a timeout.

We should first make sure that Swarm works well in a controlled environment, and then add tests that cover misbehaving peers. We currently see what happens if you try to do both of these things simultaneously.

@acud
Copy link
Contributor

acud commented Apr 12, 2019

OK I see where you're going with this and I tend to agree. I never understood why we need to drop a peer (the p2p package negotiation should already cover for protocol version discrepancies) without actually maintaining a peer blacklist (whether if a decaying one to mitigate DoS vectors or a permanent one for litigation/other cryptoeconomic violations)

@nonsense
Copy link
Contributor Author

@justelad yes, dropping peers, without maintaining a blacklist also doesn't make sense to me. Maybe your node is misbehaving because it has a temporary network problem, that could be one reason where you might want to ignore a node for a while, but try them again later.

Bottom line - this functionality needs more work (backoff, blacklists, etc.), and in its current form it is more harmful than useful.

@zelig
Copy link
Member

zelig commented Apr 12, 2019

@nonsense i must say i agree with everything you responded.
great job...
I will keep insisting with certain bits that falls under 'feature regression', but it will be much nicer to add them and check their impact or prove them pointless :)

@nonsense
Copy link
Contributor Author

PR has two approvals, so merging for now. This is not going to master in the foreseeable future, so we have plenty of time to decide if some of the dead code is actually not dead code, but useful (I doubt that). One thing to review is the light client, which AFAIK has 0 integration tests... probably worth to run the binary with --lightnode flag just to see if it works or not.

However I think it is more important to integrate the rest of the bugfixes, so that we have a more stable full swarm node, and only then worry about the rest.

@nonsense nonsense merged commit 6ad721e into ethersphere:swarm-rather-stable Apr 12, 2019
@nonsense
Copy link
Contributor Author

nonsense commented Apr 12, 2019

@zelig about the pointlessness of the features - the latencies that we have on the simplify-fetchers branch are much lower than those on master, so this of itself is a proof we are going in the right direction.

I very much welcome any improvements on those latencies - whether through a benchmark test or through an experiment between two deployments, or some other way, but definitely not just based on intuition, unless it is very easy argument to win everyone on the team on.

nonsense added a commit that referenced this pull request May 10, 2019
nonsense added a commit that referenced this pull request May 10, 2019
cmd/swarm/swarm-smoke: improve smoke tests (#1337)

swarm/network: remove dead code (#1339)

swarm/network: remove FetchStore and SyncChunkStore in favor of NetStore (#1342)
nonsense added a commit that referenced this pull request May 10, 2019
cmd/swarm/swarm-smoke: improve smoke tests (#1337)

swarm/network: remove dead code (#1339)

swarm/network: remove FetchStore and SyncChunkStore in favor of NetStore (#1342)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants