Skip to content

Conversation

@frengor
Copy link
Contributor

@frengor frengor commented Jan 16, 2022

This is a PR for adding Iterator::collect_into and Iterator::collect_with as proposed by @cormacrelf in #48597 (see #48597 (comment)).

This adds the following methods to the Iterator trait:

fn collect_into<E: Extend<Self::Item>>(self, collection: &mut E) -> &mut E;

fn collect_with<E: Extend<Self::Item>>(self, mut collection: E) -> E;

///
/// let doubled = a.iter()
/// .map(|&x| x * 2)
/// .collect_with(Vec::new());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this sets a bad example since it uses Extend and we have separate code paths for FromIterator which are optimized for the fact that we're starting with an empty vec. It's kind of like doing

let mut vec = Vec::new()
vec.extend(iter);
vec

Which is a silly thing to do when you can just use iter.collect()

For that reason I think the name also is suboptimal since it associates itself with collect() and not extend().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I put Vec::with_capacity(3) instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size_hint already takes care of that in many cases, so it would only help in cases where the size hint of the iterator isn't accurate (e.g. when it contains a filter). Or maybe when collecting into a collection with a custom allocator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so if I understood correctly, I should use a filter (or some other adapter) in the examples that makes size_hint not provide any useful info for iter.collect() right?
I could also mention that usually iter.collect() is preferred when size_hint is accurate

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Contributor

@cormacrelf cormacrelf Jan 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lack of optimisation for the empty case is a bit of a worry in terms of whether collect_with is a good API at all. Wouldn't all of the good usages be better covered by the mutable collect_into? I had envisioned we could use .collect_with(Vec::new()) to replace the turbofish, but it seems that is almost never a good idea, or at least only applicable in certain circumstances. Now we would be creating a more ergonomic API whose use we would have to immediately discourage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collect_with can be very useful in order to avoid reallocation(s) when size_hint returns the default value.

I had envisioned we could use .collect_with(Vec::new()) to replace the turbofish, but it seems that is almost never a good idea, or at least only applicable in certain circumstances.

I agree that it is almost never a good idea. I also like more the turbofish syntax where applicable.

My idea for collect_with was a method to be used when size_hint is not helpful and the collection (in which the items are collected) does not already exists. This allows shorter code:

let len1 = vec![1, 2, 3]
    .into_iter()
    .filter(|&x| x < 3)
    .collect_with(Vec::with_capacity(2))
    .len();

// Instead of
let mut vec = Vec::with_capacity(2);
let len2 = vec![1, 2, 3]
    .into_iter()
    .filter(|&x| x < 3)
    .collect_into(&mut vec)
    .len();

I think we could specify it in the doc. However, someone could still misuse the method. Calling Extend::extend_reserve should, at least, fix the speed gap between collect() and misused collect_with. (not already implemented)

Eventually, I realised collect_with can be implemented with a call to collect_into, should it be a better implementation?

@the8472
Copy link
Member

the8472 commented Jan 16, 2022

r? rust-lang/libs

@djc
Copy link
Contributor

djc commented Jan 17, 2022

Also since the original suggestion in #48597 we pulled FromIterator into the prelude for the 2021 edition, meaning you can now spell .collect<Vec<_>>() as Vec::from_iter(). Therefore I think these methods no longer make as much sense as they did 3 years ago.

@frengor
Copy link
Contributor Author

frengor commented Jan 17, 2022

collect_into and collect_with can be very useful in order to avoid reallocation(s) when size_hint returns the default value, like with filter.

Also, since FromIterator can be called using iterator.collect(), I think there might be space for a method (like collect_into) which allows Extend to be called from the iterator.
collect_with can be used to avoid a variable definition, making code shorter (see #92982 (comment)):

let len1 = vec![1, 2, 3]
    .into_iter()
    .filter(|&x| x < 3)
    .collect_with(Vec::with_capacity(2))
    .len();

// Instead of
let mut vec = Vec::with_capacity(2);
let len2 = vec![1, 2, 3]
    .into_iter()
    .filter(|&x| x < 3)
    .collect_into(&mut vec)
    .len();

@cormacrelf
Copy link
Contributor

cormacrelf commented Jan 17, 2022

You can now spell .collect<Vec<_>>() as Vec::from_iter(). Therefore I think these methods no longer make as much sense as they did 3 years ago.

"Now"? "No longer"? Both of those APIs have always been there, starting in Rust 1.0.0. FromIterator::from_iter is awkward because iterator call chains tend to be long, that's why collect has always existed. I'm not sure what else you could be referring to from the last 3 years.

Sorry, just read the full comment again, the problem with Vec::from_iter() is that must be wrapped around a long iterator chain or used via a temporary variable. Some people might find this an improvement over the turbofish, some not. Nevertheless, turbofish collect is only one of the three things being addressed by these two methods.

There are at least three things going on here:

  1. Collecting into &mut existing collections in order to reuse storage. This was the original purpose of the original PR that went too far down the road implementing things on &mut T. I think this need is addressed perfectly by collect_into. We've always had Extend::extend and it is not convenient when you have an iterator chain to append to an existing collection; you must store the iterator in a variable. collect_into is unquestionably an improvement in my opinion. It allows you to reuse allocations with exactly the ease of creating a new one. We should have added it years ago.

  2. Mitigating turbofish difficulty. Collect has always been difficult for beginners to use and experts to type because of the turbofish. I doubt we can improve on this TBH, see my comment on the code review after it was noted that collect_with(Vec::new()) is not (and without another trait bound, cannot be) optimised for empty collections, it will do incremental powers-of-two resizes (O(log n)) when one exact size allocation would suffice (O(1)). Maybe turbofish is as good as this ever gets.

  3. Pre-allocating storage with an appropriate size for unknown size iterators (like filtered ones). I don't think collect_with is doing a very good job here, mostly because of the fact that people can and will also use it with empty (specifically zero capacity) ones and get worse performance than collect::<_>(). Maybe we just need another iterator adapter that allows you to supply more information through the various size traits to the FromIterator implementation. Like iter.size_hint_min(100) or something. Then you'd need to reassure folks with documentation that this would be roughly as good as let mut v = Vec::with_capacity(100); iter.collect_into(&mut v);. Alternatively you could just let people use that construct because it seems fine for the niche use case and collect_into is awesome and less likely to be used with empty collections.

I think number 1 is done, number 2 is very difficult, and number 3 we can think of a better way than collect_with. The two methods are severable so I would be happier to merge collect_into alone, and look at 2 and 3 but primarily 3/pre-allocating filtered collect operations separately.

@the8472
Copy link
Member

the8472 commented Jan 17, 2022

@frengor

let len1 = vec![1, 2, 3]
    .into_iter()
    .filter(|&x| x < 3)
    .collect_with(Vec::with_capacity(2))
    .len();

I think that's a very poor motivating example because it could be replaced with a count() without intermediate vec.
And you likely have to introduce additional local vars anyway because otherwise you drop the owned vec on the floor as soon as you pass it to a by-ref method.

@cormacrelf

it was noted that collect_with(Vec::new()) is not (and without another trait bound, cannot be) optimised for empty collections, it will do incremental powers-of-two resizes (O(log n)) when one exact size allocation would suffice (O(1)).

It's not quite that bad. Vec's Extend implementation does make use of the size hint. The issue is that we're losing some other optimizations such as allocation recycling.

Mitigating turbofish difficulty. Collect has always been difficult for beginners to use and experts to type because of the turbofish.

I don't think difficult is the right word here, I already have muscle memory to type ::<> to create paired brackets then left-arrow and fill in whatever type hint is needed. Editors could be smarter about this and offer better autocompletion. The comparison operators are not an obstacle because the ::< exists to disambiguate that very issue. And there also is the alternative of doing

let v: Vec<u8> = iter.collect();

And sometimes you need the more specific type hint anyway because it can't infer the exact type that goes into the vec.

Maybe we just need another iterator adapter that allows you to supply more information through the various size traits to the FromIterator implementation. Like iter.size_hint_min(100) or something.

Providing the wrong size hint is considered a bug, which can cause implementations to misbehave (just not in unsound ways). So we'd have to verify correctness during iteration. And if we're dealing with a situation where the iterator length isn't known then these checks probably can't be optimized out either because the optimizer can't reason about the count.

@cormacrelf
Copy link
Contributor

What is the mechanism by which you can recycle an allocation in Vec::from_iter? I presume you mean from an IntoIter. Moreover is that ability something you can bless collect_with with?

@the8472
Copy link
Member

the8472 commented Jan 17, 2022

What is the mechanism by which you can recycle an allocation in Vec::from_iter? I presume you mean from an IntoIter.

Specialization, RFC 1210. In this case it's applied to impl FromIterator for Vec. And yes, for an IntoIter source.

Moreover is that ability something you can bless collect_with with?

Well, since it uses Extend we'd have to do a runtime check to see if the length is zero and then go through the FromIterator path instead, but then we have both the Extend and FromIterator code paths, which costs compile time. Directly using collect or from_iter when applicable is preferable.

So it's a tradeoff between ergonomics, compiletime and runtime performance.

@cormacrelf
Copy link
Contributor

cormacrelf commented Jan 17, 2022

Oh right, only when the iterator actually is IntoIter. Not after you do arbitrary transformations on it.

Couldn't you just have a custom implementation of Iterator::collect_with for IntoIter? It doesn't even need specialisation. It's just a default trait method with an override. There is a generic on there but at least IntoIter can be the only thing with any kind of extra checks, some of which can be done at compile time.

@the8472
Copy link
Member

the8472 commented Jan 17, 2022

Oh right, only when the iterator actually is IntoIter. Not after you do arbitrary transformations on it.

Well, not arbitrary transformations, but ...

#[test]
fn test_from_iter_specialization_with_iterator_adapters() {
fn assert_in_place_trait<T: InPlaceIterable>(_: &T) {}
let src: Vec<usize> = vec![0usize; 256];
let srcptr = src.as_ptr();
let iter = src
.into_iter()
.enumerate()
.map(|i| i.0 + i.1)
.zip(std::iter::repeat(1usize))
.map(|(a, b)| a + b)
.map_while(Option::Some)
.skip(1)
.map(|e| if e != usize::MAX { Ok(std::num::NonZeroUsize::new(e)) } else { Err(()) });
assert_in_place_trait(&iter);
let sink = iter.collect::<Result<Vec<_>, _>>().unwrap();
let sinkptr = sink.as_ptr();
assert_eq!(srcptr, sinkptr as *const usize);
}

@frengor
Copy link
Contributor Author

frengor commented Jan 17, 2022

@the8472

I think that's a very poor motivating example because it could be replaced with a count() without intermediate vec.

The point I wanted to show was that a variable definition can be avoided using collect_with, I didn't looked too much at the code, my fault.
A more appropriate example can be:

let filtered_vec = (0..50)
    .filter(|&x| x < 40)
    .collect_with(Vec::with_capacity(40));

// Instead of
let mut vec = Vec::with_capacity(40);
let filtered_vec2 = (0..50)
    .filter(|&x| x < 40)
    .collect_into(&mut vec);

@the8472
Copy link
Member

the8472 commented Jan 17, 2022

That's just 1 line more, and you wouldn't need the 2nd let since vec remains valid. And you already avoided the type annotation.

@frengor
Copy link
Contributor Author

frengor commented Jan 17, 2022

That's just 1 line more, and you wouldn't need the 2nd let since vec remains valid. And you already avoided the type annotation.

This is true. I still think it is slightly more readable with collect_with, especially with very long chains. However, I don't know if one line really matters when you already have so many lines.

@cormacrelf
Copy link
Contributor

So how do we feel about splitting this and merging collect_into first? I don't see any objections to it, the only questions here pertain to collect_with. Having the mutable version would be a big win on its own, we could be proud of that, and it might attract a bit more attention to the other concerns.

@scottmcm
Copy link
Member

I'm definitely skeptical of having both, TBH. If you think that one of them is more important, I definitely think that starting with that one is better.

And I agree that .collect_into(&mut is the more important one.

@frengor
Copy link
Contributor Author

frengor commented Jan 18, 2022

I think I'm going to do it. Is it better to do a commit on this PR or a new PR?

@cormacrelf
Copy link
Contributor

cormacrelf commented Jan 18, 2022

Here is OK but would need to edit the PR's title and introductory comment because that gets copied around by the bots to various merge commits and summaries. Also for Rust it's generally best to force push a clean commit history, in this case one commit with just the one method. GitHub keeps old refs but I'm not sure it stores them forever, so maybe a new PR is best for preservation. It's up to you basically.

@scottmcm
Copy link
Member

@frengor Up to you which you think is more helpful for the reviewer. If the conversation here is still valuable, might as well change this. (I don't see any comments from josh, so you could probably even just force-push it to a single new commit.) If the conversation here is obviated by the new approach, then might as well close this one and make a new PR to clean things up.

@frengor
Copy link
Contributor Author

frengor commented Jan 19, 2022

All the concerns are about collect_with, so I think it's better to do a new PR to keep things clean. I'll obviously reference this PR.

@cormacrelf
Copy link
Contributor

Thanks @frengor, great work, keep it up!

Dylan-DPC added a commit to Dylan-DPC/rust that referenced this pull request Mar 9, 2022
Add Iterator::collect_into

This PR adds `Iterator::collect_into` as proposed by `@cormacrelf` in rust-lang#48597 (see rust-lang#48597 (comment)).
Followup of rust-lang#92982.

This adds the following method to the Iterator trait:

```rust
fn collect_into<E: Extend<Self::Item>>(self, collection: &mut E) -> &mut E
```
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this pull request Mar 9, 2022
Add Iterator::collect_into

This PR adds `Iterator::collect_into` as proposed by ``@cormacrelf`` in rust-lang#48597 (see rust-lang#48597 (comment)).
Followup of rust-lang#92982.

This adds the following method to the Iterator trait:

```rust
fn collect_into<E: Extend<Self::Item>>(self, collection: &mut E) -> &mut E
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants