Skip to content

Conversation

@bholley
Copy link
Contributor

@bholley bholley commented May 2, 2017

This is what Gecko does [1]. If I understand correctly, an Acquire load on
an address just needs to synchronize with other Release operations on that
specific address, whereas an Acquire fence needs to synchronize globally
with all other Release operations in the program.

[1] http://searchfox.org/mozilla-central/rev/ae8c2e2354db652950fe0ec16983360c21857f2a/xpcom/base/nsISupportsImpl.h#337

This is what Gecko does [1]. If I understand correctly, an Acquire load on
an address just needs to synchronize with other Release operations on that
specific address, whereas an Acquire fence needs to synchronize globally
with all other Release operations in the program.

[1] http://searchfox.org/mozilla-central/rev/ae8c2e2354db652950fe0ec16983360c21857f2a/xpcom/base/nsISupportsImpl.h#337
@rust-highfive
Copy link
Contributor

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @alexcrichton (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@bholley
Copy link
Contributor Author

bholley commented May 2, 2017

It's entirely possible that I have this wrong somehow, but if so I'd love to understand why.

r? @aturon

@rust-highfive rust-highfive assigned aturon and unassigned alexcrichton May 2, 2017
@bholley
Copy link
Contributor Author

bholley commented May 2, 2017

Also added a long comment explaining why we need the Acquire/Release handshake.

CC @eddyb @golddranks @Manishearth @emilio @SimonSapin

@alexcrichton
Copy link
Member

Note that there's another fence for the weak count which may matter here as well? Could you benchmark to see what the difference is?

@bholley
Copy link
Contributor Author

bholley commented May 2, 2017

Note that there's another fence for the weak count which may matter here as well?

Yeah, that should probably be a load as well I think.

Could you benchmark to see what the difference is?

I think a good benchmark will be hard to construct here, because it depends entirely on the cache line breakdown and what the other CPUs happen to be doing at the same time. This patch just gives the CPU more degrees of freedom in contended situations. If there's no contention, I'd expect there to be no difference.

@aturon
Copy link
Contributor

aturon commented May 2, 2017

@bholley I'm curious, though, if this change is motivated by some direct performance problems or benchmarking you've done? I ask just because this kind of change will entail some pretty intensive reviewing, and the code right now at least has solid lineage back through clang, IIRC.

@bholley
Copy link
Contributor Author

bholley commented May 2, 2017

Nope - I was just forking Arc for other reasons and came across this.

If the answer is that we think the risk/reward here isn't worth it for now, that's fine. We should at least check in the comment though.

@golddranks
Copy link
Contributor

I don't think you meant to tag me? (Which means someone may be missing?)

@bholley
Copy link
Contributor Author

bholley commented May 3, 2017

Yes, I meant @gankro, sorry.

//
// Note that Rust is not C++, and so it's valid to ask whether we could
// do without the Acquire/Release handshake here. The contents of an Arc
// are immutable, and thus we could hypothetically assume that there are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contents aren't immutable, due to the potential for interior mutability (through UnsafeCell or the atomics you mention below). When thinking about this fence in the past, I'd always considered it as ensuring that any outstanding interior non-Relaxed writes are visible prior to the destructor running.

I do agree that allocators in general are virtually guaranteed to have enough fencing to at least avoid relaxed writes landing after the memory is transferred to another thread. But I worry about people relying on ordering with respect to the destructor code running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When thinking about this fence in the past, I'd always considered it as ensuring that
any outstanding interior non-Relaxed writes are visible prior to the destructor running.

Oh I see. I'd been thinking of such writes as being either atomic (in which case they're governed by their own memory ordering) or protected by something like a Mutex, which have their own synchronization. But I realize now that the destructor doesn't actually acquire the mutex before dropping the contents, and so we can't rely on its internal synchronization to flush the writes.

@bholley
Copy link
Contributor Author

bholley commented May 3, 2017

Let me know what you want to do about the fence. If it's not worth the risk of fixing (at least right now), I can clean up the comment to take the above discussion into account and land it separately.

In general though, I do think it's worth fixing these kinds of potential performance issues when we find them. An overly conservative fence in Arc is exactly the sort of thing that can cause bad performance under high system load, but would never be discovered by profiling because (a) it's hard to reproduce those conditions reliably and (b) there are very few people, even in the Rust community, who are comfortable reasoning about memory orderings.

@nagisa
Copy link
Member

nagisa commented May 3, 2017

cc @Amanieu

@aturon
Copy link
Contributor

aturon commented May 3, 2017

@bholley

I agree that it's worth exploring these kinds of fixes, but they are relatively high risk and cost, so I'd like to have some evidence that they make a difference (beyond purely hypothetical performance reasoning).

That's a standard we've been trying to apply across std, since we receive a lot of PRs with "implementation speedups" that don't come with any benchmarks, and make the code more complex.

In short, my feeling is that, absent concrete evidence of a performance improvement here, I'd prefer not to pursue it at this time.

(Thanks, though, for the PR!)

@Amanieu
Copy link
Member

Amanieu commented May 3, 2017

I think the comments about the allocator are misleading. The purpose of the fence is to ensure that any operations performed by other threads on the shared data happen before the Drop impl of the shared data.

I don't think there is much performance difference between an acquire fence and an acquire load:

  • ARM:
    • Acquire fence: dmb ishld (ARMv8) dmb ish (ARMv7)
    • Acquire load: ldaex (ARMv8) ldr; dmb ish (ARMv7)
  • AArch64:
    • Acquire fence: dmb ishld
    • Acquire load: ldar
  • x86/x86-64:
    • Acquire fence: (none)
    • Acquire load: mov

So basically, with a fence you avoid an extra load. Also, as far as I know, it is perfectly valid to have a fence synchronize with a atomic operation.

@aidanhs aidanhs added the S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. label May 3, 2017
@aturon
Copy link
Contributor

aturon commented May 3, 2017

I've approved the pure-docs PR, and am going to close this one. Thanks again @bholley!

@aturon aturon closed this May 3, 2017
frewsxcv added a commit to frewsxcv/rust that referenced this pull request May 3, 2017
Document the reasoning for the Acquire/Release handshake when dropping Arcs.

Split out from rust-lang#41714. r? @aturon
@bholley
Copy link
Contributor Author

bholley commented May 5, 2017

@Amanieu interesting! I'd always thought that acquire/release fences imposed a greater restriction on the compiler and CPU than acquire/release loads/stores, since the latter are only enforced when the two threads are operating on the same memory address. So, for example, a CPU would need to synchronously flush the cache lines if core X released mutex A and then core Y acquired mutex A, but would not need to do so if core Y acquired mutex B instead.

Do I have that wrong, or does the C++11 memory model allow for a wider range of optimization than current compilers and CPUs actually use?

@Gankra
Copy link
Contributor

Gankra commented May 5, 2017

The C11 model is engineered to support things like the DEC Alpha, so yeah modern hardware doesn't take full advantage of it. Making sloppy code work and fast is kinda the job of hardware devs.

@nagisa
Copy link
Member

nagisa commented May 5, 2017

Making sloppy code work and fast is kinda the job of hardware devs.

Cough… compiler devs come before that.

@mzabaluev
Copy link
Contributor

mzabaluev commented Jun 29, 2019

Comment withdrawn, I misunderstood the spec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants