swarm/network: remove dead code #1339

nonsense · 2019-04-11T10:52:08Z

This PR is:

Fixing message handlers to run asynchronously so that we don't block the TCP socket for a given peer.
Fixing the signature for peer.Drop, which should not take an error as it doesn't do anything with that argument.
Remove dead code - SwarmChunkServer
Remove dead code - partly removing skipCheck - this will be a longer effort
Remove the RETRIEVE_REQUEST stream, which is not used in any production code for anything.
Remove tests that somehow expect an OfferedHashes messages after a RetrieveRequest? This is definitely not how master code has been working for months.

nonsense · 2019-04-11T10:53:46Z

swarm/network/stream/delivery.go

 	case *ChunkDeliveryMsgSyncing:
 		msg = (*ChunkDeliveryMsg)(r)
 		mode = chunk.ModePutSync
+	case *ChunkDeliveryMsg:


One of the tests is actually sending a ChunkDeliveryMsg and not the other two specific types, therefore failing.

This is not code that is hit in production, but I decided to add this here temporarily, rather than dig into the test.

It is not clear why this test was not failing with the previous code.

nonsense · 2019-04-11T10:57:20Z

swarm/network/stream/delivery_test.go

-	)
-	streamer.delivery.RequestFromPeers(ctx, req)
-
-	stream := NewStream(swarmChunkServerStreamName, "", true)


SwarmChunkServer doesn't exist anymore, and RETRIEVE_REQUEST doesn't exist anymore.

Also this is a huge test, which only checks that if we issue a request to RequestFromPeers, we get specific messages back - SubscribeMsg for the RETRIEVE_REQUEST stream, and a RetrieveRequestMsg - this is rather useless for two reasons:

We are locking behaviour that doesn't provide much value.

We are testing a very basic functionality, which is hit 1000 times per a single integration test.

No need to lock this behaviour in such a complicated way. Not to mention that RETRIEVE_REQUEST stream serves no purpose in any production code for a long time.

nonsense · 2019-04-11T10:58:14Z

swarm/network/stream/delivery_test.go

 				Code: 6,
 				Msg: &ChunkDeliveryMsg{
-					Addr:  hash,
-					SData: hash,


How did that work before is beyond me... We don't have any validation for ChunkDeliveryMsg ?

further up in localstore i guess there is. not on msg level

true, there is validation in localstore, i just found this confusing here. now that you mention it, it makes sense.

janos

There is a lot of work in this PR, thanks Anton.

node/config.go

swarm/network/stream/delivery.go

swarm/network/stream/stream.go

swarm/network/stream/streamer_test.go

janos · 2019-04-11T11:30:38Z

swarm/network/stream/stream.go

 	return pstreams
 }
+
+func (api *API) GetPeerClientSubscriptions() map[string][]string {


Maybe to write a test for this function as well?

This function is trivial, just like the GetPeerServerSubscriptions. I don't think we need tests for it, chances that someone will refactor it and break it are so slim, that I'd rather we don't add 50 lines of tests like the other one.

Also this is not an API call that is part of critical functionality, it is here for debug purposes.

If we add unit test for this, we will probably have to remove it, once we review the Stream implementation - something that has been in discussions over the last weeks.

Actually it is not 50 lines, but much more, we also have TestGetServerSubscriptionsRPC

TestGetServerSubscriptionsRPC depends on:

simulations package

adapters

netStore

delivery

spec for SubscribeMsg

json snapshots

exact naming of the function, not covered by the compiler by dynamically linked

I'd rather we don't write such low-value tests for such trivial functionalities.

Well, tools that can be used for testing the code should not be relevant for decision if the code should be tested. If this code exists and somebody refactors the stream package, I suppose that it would be nice to have a test to verify that there is no regression. I would not block this PR for not testing this code if it is not an issue for others.

@janos I strongly disagree with this.

If you are using 10 libraries to test a for loop that is adding values to a slice, you are increasing the complexity of the codebase for no good reason.

Now that these tests are part of the codebase, every time we change any of those 10 libraries, we also have to modify the tests. This is a cost to development that we should not ignore. The benefit of this test is that we are sure that we have not broken the for loop that is adding values to a slice - rather low-value in my opinion.

Thanks for explanations. I will not insist on this test.

@janos giving a bit more thought to this particular piece of code - the sole purpose of this API call exists is to be a helper method to debug the already broken subscriptions implementation.

I'd rather we focus on solving the bugs in the subscriptions implementation (without commenting what this involves - rewrite or bug fix or whatever), than write tests for the helper methods we add to the codebase to help us track the bigger issues in Swarm.

This feels like adding a test for a metrics counter or timer to me.

So if we insist to have a test on this API (which I see you don't in the comment above), I'd rather just remove it altogether, and find another way to debug the subscriptions functionality.

janos

Thanks for addressing my comments Anton.

nonsense · 2019-04-11T15:14:11Z

@janos thanks for thorough review and for putting up with my rushed PR!

zelig

are you sure light node functionality does not regress here?

changes here are maybe too big, too hasty ?

zelig · 2019-04-12T04:03:45Z

p2p/protocols/protocol.go

 // TODO: may need to implement protocol drop only? don't want to kick off the peer
 // if they are useful for other protocols
-func (p *Peer) Drop(err error) {
+func (p *Peer) Drop() {


can we please preserve the error and log it or even put it in p2p.DiscSubprotocolError?
this is losing too much contextual info

i agree this could be converted to:

func (p *Peer) Drop(err error) { if err == nil{ err = p2p.DiscSubprotocolError } p.Disconnect(err) }

We could log the error, but I'd rather not propagate it down to p2p, and replace DiscSubprotocolError just now, as this is changing behaviour, and we already have too many changes.

The current change preserves the behavior we already have.

@justelad I am not changing this behavior as well in this PR, let's keep it as small as possible.

zelig

is removing the retrieve request stream justified because in current production code skipcheck is true. There was a reason why skipcheck was a (tested) option for retrieve request stream: we wanted to check if in some scenarios, chunk delivery for retrieve requests should go via the offered/wanted hash roundtrip. While this may introduce latency, without it multiple successful requests will lead to multiple deliveries.

At this point I am leaning towards handling the latter issue with request cancellation #409 which also has the chance to cancel requests for all forwarding nodes

In the light of these changes, the question arises: why are retrieve requests even part of the stream package and stream protocol. it would make sense to remove request handling entirely from stream, put it under network and make it part of bzz protocol directly.

nonsense · 2019-04-12T11:47:48Z

@zelig good question. I think code that is not used by the users, but only hit in unit tests, has no place in the codebase as a general rule of thumb.

Currently we have:

RETRIEVE_REQUEST stream (retrieve requests handled through an offered/wanted hashes flow?),
SwarmChunkServer
skipCheck
priority queue
TakeoverProof

None of this is necessary for core Swarm functionality and is increasing the carry cost of development, and making code very difficult to reason about. This is the primary reason we have had bugs in the most simple use-cases of Swarm - namely upload a file and try to fetch it from a different node in the network.

To summarize, the bugs we've found during this effort, that were 'hidden' due to this non-used functionality:

blocking TCP sockets, hidden behind peer.Drop() and hidden behind priority queue sends.
an order of magnitude more ChunkDelivery messages due to the offered-hashes/wanted-hashes protocol not keeping track of them
(not a known bug, but very convoluted implementation of the default RetrieveRequest flow)
impossibility to add instrumentation and see Swarm behavior in a real network due to priority queues and convoluted use of contexts.

Because of all this I think it is justified to remove this dead code.

Regarding the light client - it is possible that this is breaking a very small part of it, but fixing it should be rather simple.

nonsense · 2019-04-12T12:06:11Z

RetrieveRequests in the simplified version of the code, have nothing to do with Streams. The only thing they share is the Fetchers, so that we have only 1 request per chunk, no matter if this request was done through a RetrieveRequest or through Syncing.

You have a point that this doesn't even belong in this package, but I have a hard time getting even simpler changes in, so I'd rather we change this gradually.

janos · 2019-04-12T12:09:47Z

Moving retrieve requests from the stream package is a very good idea. It would just need a different protocol as it is now part of the stream protocol specification.

nonsense · 2019-04-12T12:12:43Z

Yes, @janos, that's correct. Let's postpone this until we have all the known bug fixes integrated in the codebase, together with a solution for the subscription bug. Refactoring RetrieveRequests out of the Stream protocol is rather low priority right now, in my opinion.

janos · 2019-04-12T12:13:39Z

I agree, @nonsense.

acud

@nonsense great work. i left a few minor comments

acud · 2019-04-12T08:54:28Z

p2p/protocols/protocol.go

 // TODO: may need to implement protocol drop only? don't want to kick off the peer
 // if they are useful for other protocols
-func (p *Peer) Drop(err error) {
+func (p *Peer) Drop() {


i agree this could be converted to:

func (p *Peer) Drop(err error) { if err == nil{ err = p2p.DiscSubprotocolError } p.Disconnect(err) }

acud · 2019-04-12T12:24:36Z

swarm/network/stream/lightnode_test.go


-// This test checks the default behavior of the server, that is
-// when it is serving Retrieve requests.
-func TestLigthnodeRetrieveRequestWithRetrieve(t *testing.T) {


I think that the registryOptions is supposed to indicate the this is a light node behavior but I see what you mean, it is not so clear what is being tested here

swarm/network/stream/lightnode_test.go

swarm/network/stream/stream.go

acud · 2019-04-12T12:34:31Z

swarm/network/stream/stream.go

-		return p.handleOfferedHashesMsg(ctx, msg)
+		go func() {
+			err := p.handleOfferedHashesMsg(ctx, msg)
+			if err != nil {


i am good with leaving these functions as is right now for clarity.
if we keep the error (and actually use it) in peer.Drop we could drop the log because the error will be automatically logged on the peer drop

nonsense · 2019-04-12T12:45:55Z

There are many discussions about peer.Drop() in this PR. I want to highlight that we are failing with developing Swarm in the most basic use-case when all the network is of trusted nodes, that adhere to the protocols.

Adding peer.Drop() at various locations and reconnecting right after that to the same peer is already broken and wrong. If the peer is misbehaving, why are we re-connecting to them right after we drop them?

Additionally peer.Drop() is a very insidious way to cover for deadlocks and logical problems with the protocols - if a peer is misbehaving (read deadlocked) and it times-out, we just disconnect and reconnect and move on, essentially solving the deadlock at an expense of a timeout.

We should first make sure that Swarm works well in a controlled environment, and then add tests that cover misbehaving peers. We currently see what happens if you try to do both of these things simultaneously.

acud · 2019-04-12T12:55:59Z

OK I see where you're going with this and I tend to agree. I never understood why we need to drop a peer (the p2p package negotiation should already cover for protocol version discrepancies) without actually maintaining a peer blacklist (whether if a decaying one to mitigate DoS vectors or a permanent one for litigation/other cryptoeconomic violations)

nonsense · 2019-04-12T12:59:28Z

@justelad yes, dropping peers, without maintaining a blacklist also doesn't make sense to me. Maybe your node is misbehaving because it has a temporary network problem, that could be one reason where you might want to ignore a node for a while, but try them again later.

Bottom line - this functionality needs more work (backoff, blacklists, etc.), and in its current form it is more harmful than useful.

zelig · 2019-04-12T13:03:59Z

@nonsense i must say i agree with everything you responded.
great job...
I will keep insisting with certain bits that falls under 'feature regression', but it will be much nicer to add them and check their impact or prove them pointless :)

nonsense · 2019-04-12T13:04:38Z

PR has two approvals, so merging for now. This is not going to master in the foreseeable future, so we have plenty of time to decide if some of the dead code is actually not dead code, but useful (I doubt that). One thing to review is the light client, which AFAIK has 0 integration tests... probably worth to run the binary with --lightnode flag just to see if it works or not.

However I think it is more important to integrate the rest of the bugfixes, so that we have a more stable full swarm node, and only then worry about the rest.

nonsense · 2019-04-12T13:08:36Z

@zelig about the pointlessness of the features - the latencies that we have on the simplify-fetchers branch are much lower than those on master, so this of itself is a proof we are going in the right direction.

I very much welcome any improvements on those latencies - whether through a benchmark test or through an experiment between two deployments, or some other way, but definitely not just based on intuition, unless it is very easy argument to win everyone on the team on.

cmd/swarm/swarm-smoke: improve smoke tests (#1337) swarm/network: remove dead code (#1339) swarm/network: remove FetchStore and SyncChunkStore in favor of NetStore (#1342)

swarm/network: remove dead code

e709d5e

nonsense added the ready for review label Apr 11, 2019

nonsense requested review from acud, janos and zelig April 11, 2019 10:52

nonsense commented Apr 11, 2019

View reviewed changes

nonsense added 2 commits April 11, 2019 13:19

remove tests that reference the RETRIEVE_REQUEST stream

f0ef324

fix Drop(err)

3321d14

janos reviewed Apr 11, 2019

View reviewed changes

nonsense added 3 commits April 11, 2019 13:51

swarm/network/stream: log errors if handlers fail

ea51670

remove skipCheck log line

7734d31

quit emit funciton on r.quit

ebab145

nonsense force-pushed the remove-dead-code branch from ba4cb04 to ebab145 Compare April 11, 2019 12:03

nonsense added 5 commits April 11, 2019 14:05

revert node/config warning

89c908d

remove retrieval options

7d23cb2

remove retrieval options

ab8bb36

add delivery.Close() and quit chan

33420e0

remove emit metrics

d568f0b

janos approved these changes Apr 11, 2019

View reviewed changes

make the linter happy

4bb2f0b

nonsense force-pushed the remove-dead-code branch from 2bb87ae to 4bb2f0b Compare April 11, 2019 16:06

zelig reviewed Apr 12, 2019

View reviewed changes

acud reviewed Apr 12, 2019

View reviewed changes

acud approved these changes Apr 12, 2019

View reviewed changes

nonsense merged commit 6ad721e into ethersphere:swarm-rather-stable Apr 12, 2019

nonsense added a commit that referenced this pull request May 10, 2019

swarm/network: remove dead code (#1339)

9a5bfef

nonsense mentioned this pull request Jul 30, 2019

network: new stream! protocol and pull syncer implementation #1538

Merged

swarm/network: remove dead code #1339

swarm/network: remove dead code #1339

Uh oh!

Conversation

nonsense commented Apr 11, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janos left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janos left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nonsense commented Apr 11, 2019

Uh oh!

zelig left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zelig left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nonsense commented Apr 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nonsense commented Apr 12, 2019

Uh oh!

janos commented Apr 12, 2019

Uh oh!

nonsense commented Apr 12, 2019

Uh oh!

janos commented Apr 12, 2019

Uh oh!

acud left a comment

Choose a reason for hiding this comment

Uh oh!

janos left a comment •

edited

Loading

zelig left a comment •

edited

Loading

zelig left a comment •

edited

Loading

nonsense commented Apr 12, 2019 •

edited

Loading

nonsense commented Apr 12, 2019 •

edited

Loading