Skip to content

Conversation

notriddle
Copy link
Contributor

@notriddle notriddle commented Nov 13, 2024

Potentially #131156 — need to try reproducing the problem with windows

Preview and profiler results

Here's some quick profiling in Firefox done on the rust compiler docs:

Here's the results for the node.js profiler:

Here's a copy that you can use to try it out. Compare it with the nightly. Try typing typecheckercontext one character at a time, slowly.

The fuzzy match algo is based on Fast String Correction with Levenshtein-Automata and the corresponding implementation code in moman and Lucene; the bit-packing representation comes from Lucene, but the actual matcher is more based on fsc.py. As suggested in the paper, a trie is used to represent the FSA dictionary.

The same trie is used for prefix matching. Substring matching is done with a side table of three-character1 windows that point into the trie.

User-visible changes

I don't expect anybody to notice anything, but it does cause two changes:

  • Substring matches, in the middle of a name, only apply if there's three or more characters in the search query.
  • Levenshtein distance limit now maxes out at two. In the old version, the limit was w/3, so you could get looser matches for queries with 9 or more characters1 in them.
  • It uses more RAM.
  • It's faster (assuming you don't swap thrash).

Footnotes

  1. technically utf-16 code units 2

Preview and profiler results
----------------------------

Here's some quick profiling in Firefox done on the rust compiler docs:

- Before: https://share.firefox.dev/3UPm3M8
- After: https://share.firefox.dev/40LXvYb

Here's the results for the node.js profiler:

- https://notriddle.com/rustdoc-html-demo-15/trie-perf/index.html

Here's a copy that you can use to try it out. Compare it with [the nightly].
Try typing `typecheckercontext` one character at a time, slowly.

- https://notriddle.com/rustdoc-html-demo-15/compiler-doc-trie/index.html

[the nightly]: https://doc.rust-lang.org/nightly/nightly-rustc/

The fuzzy match algo is based on [Fast String Correction with
Levenshtein-Automata] and the corresponding implementation code in [moman]
and [Lucene]; the bit-packing representation comes from Lucene, but the
actual matcher is more based on `fsc.py`. As suggested in the paper, a
trie is used to represent the FSA dictionary.

The same trie is used for prefix matching. Substring matching is done with a
side table of three-character[^1] windows that point into the trie.

[Fast String Correction with Levenshtein-Automata]: https://github.com/tpn/pdfs/blob/master/Fast%20String%20Correction%20with%20Levenshtein-Automata%20(2002)%20(10.1.1.16.652).pdf
[Lucene]: https://fossies.org/linux/lucene/lucene/core/src/java/org/apache/lucene/util/automaton/Lev1TParametricDescription.java
[moman]: https://gitlab.com/notriddle/moman-rustdoc

User-visible changes
--------------------

I don't expect anybody to notice anything, but it does cause two changes:

- Substring matches, in the middle of a name, only apply if there's three
  or more characters in the search query.
- Levenshtein distance limit now maxes out at two. In the old version,
  the limit was w/3, so you could get looser matches for queries with
  9 or more characters[^1] in them.

[^1]: technically utf-16 code units
@rustbot
Copy link
Collaborator

rustbot commented Nov 13, 2024

r? @fmease

rustbot has assigned @fmease.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue. labels Nov 13, 2024
@notriddle
Copy link
Contributor Author

r? @GuillaumeGomez

@rustbot rustbot assigned GuillaumeGomez and unassigned fmease Nov 13, 2024
@notriddle notriddle added A-rustdoc-search Area: Rustdoc's search feature T-rustdoc-frontend Relevant to the rustdoc-frontend team, which will review and decide on the web UI/UX output. labels Nov 13, 2024
} else {
const sb = name.charCodeAt(substart);
let child;
if (this.children[sb] !== undefined) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to check this.children.length < sb?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a sparse array.

Copy link
Member

@GuillaumeGomez GuillaumeGomez Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this explanation and link on the field definition please. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, it's added.

@GuillaumeGomez
Copy link
Member

Apart from small nits, looks good to me. Performance improvement is really impressive!

@rustbot
Copy link
Collaborator

rustbot commented Nov 13, 2024

Some changes occurred in HTML/CSS/JS.

cc @GuillaumeGomez, @jsha

@GuillaumeGomez
Copy link
Member

Thanks!

@bors r+

@bors
Copy link
Collaborator

bors commented Nov 14, 2024

📌 Commit e534f47 has been approved by GuillaumeGomez

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 14, 2024
bors added a commit to rust-lang-ci/rust that referenced this pull request Nov 14, 2024
…llaumeGomez

Rollup of 5 pull requests

Successful merges:

 - rust-lang#132172 (borrowck diagnostics: suggest borrowing function inputs in generic positions)
 - rust-lang#132649 (add ./x clippy ci)
 - rust-lang#133005 (rustdoc: use a trie for name-based search)
 - rust-lang#133034 (update download-rustc comments and default)
 - rust-lang#133036 (add myself into `users_on_vacation` on triagebot)

r? `@ghost`
`@rustbot` modify labels: rollup
@bors bors merged commit fc7ca70 into rust-lang:master Nov 14, 2024
6 checks passed
@rustbot rustbot added this to the 1.84.0 milestone Nov 14, 2024
rust-timer added a commit to rust-lang-ci/rust that referenced this pull request Nov 14, 2024
Rollup merge of rust-lang#133005 - notriddle:notriddle/trie-search, r=GuillaumeGomez

rustdoc: use a trie for name-based search

Potentially rust-lang#131156 — need to try reproducing the problem with `windows`

Preview and profiler results
----------------------------

Here's some quick profiling in Firefox done on the rust compiler docs:

- Before: https://share.firefox.dev/3UPm3M8
- After: https://share.firefox.dev/40LXvYb

Here's the results for the node.js profiler:

- https://notriddle.com/rustdoc-html-demo-15/trie-perf/index.html

Here's a copy that you can use to try it out. Compare it with [the nightly]. Try typing `typecheckercontext` one character at a time, slowly.

- https://notriddle.com/rustdoc-html-demo-15/compiler-doc-trie/index.html

[the nightly]: https://doc.rust-lang.org/nightly/nightly-rustc/

The fuzzy match algo is based on [Fast String Correction with Levenshtein-Automata] and the corresponding implementation code in [moman] and [Lucene]; the bit-packing representation comes from Lucene, but the actual matcher is more based on `fsc.py`. As suggested in the paper, a trie is used to represent the FSA dictionary.

The same trie is used for prefix matching. Substring matching is done with a side table of three-character[^1] windows that point into the trie.

[Fast String Correction with Levenshtein-Automata]: https://github.com/tpn/pdfs/blob/master/Fast%20String%20Correction%20with%20Levenshtein-Automata%20(2002)%20(10.1.1.16.652).pdf
[Lucene]: https://fossies.org/linux/lucene/lucene/core/src/java/org/apache/lucene/util/automaton/Lev1TParametricDescription.java
[moman]: https://gitlab.com/notriddle/moman-rustdoc

User-visible changes
--------------------

I don't expect anybody to notice anything, but it does cause two changes:

- Substring matches, in the middle of a name, only apply if there's three or more characters in the search query.
- Levenshtein distance limit now maxes out at two. In the old version, the limit was w/3, so you could get looser matches for queries with 9 or more characters[^1] in them.
- It uses more RAM.
- It's faster (assuming you don't swap thrash).

[^1]: technically utf-16 code units
@notriddle notriddle deleted the notriddle/trie-search branch November 14, 2024 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-rustdoc-search Area: Rustdoc's search feature S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue. T-rustdoc-frontend Relevant to the rustdoc-frontend team, which will review and decide on the web UI/UX output.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants