Use markdown + html parsers to detect codeblocks #107

jez · 2016-05-30T22:44:34Z

Relevant Issues

Summary of Changes

This commit builds upon the markdown-to-ast integration we added in #91,
using markdown-to-ast in lib/get-html-headers.js.

We first filter our lines to only look for HTML sections (using the
markdown parser), then in these HTML-only sections, we search for headers,
as long as they're not inside pre tags.

This is accomplished by keeping a stack of tags we've descended through. We
can peek at the top of the stack to see if we're directly under a
pre tag, meaning we should ignore everything until we see the closing
pre. Otherwise, we can collect text from inside heading tags as normal.

Thus, the core changes in this commit are:

grabbing is a stack now, instead of a string
markdown-to-ast lets us filter for only HTML blocks
htmlparser2 lets us check if we're inside a heading or pre tag

Test Plan

This pull request adds two passing tests:

one that makes sure we eliminate bad headings nested in pre tags, and
one that would fail a non-robust, regex-based search for code blocks,
excluding headings that should have been included

Basically, the first checks for false positives, and the second checks
for false negatives.

@yomed

Both of these tests should be fixed before we can consider thlorenz#37, thlorenz#38, and thlorenz#88 as resolved. Credit goes to @yomed for `test/fixtures/readme-with-code.md`.

jez · 2016-05-30T22:45:22Z

/cc @Anthropic

Fixes thlorenz#37, thlorenz#38, thlorenz#84, thlorenz#88, thlorenz#99. This commit builds upon the markdown-to-ast integration we added in thlorenz#91, using markdown-to-ast in `lib/get-html-headers.js`. We first filter our lines to only look for HTML sections (using the markdown parser), then in these HTML-only sections, we search for headers, as long as they're not inside pre tags. This is accomplished by keeping a stack of tags we've descended through. We can peek at the top of the stack to see if we're directly under a `pre` tag, meaning we should ignore everything until we see the closing `pre`. Otherwise, we can collect text from inside heading tags as normal. Thus, the core changes in this commit are: - `grabbing` is a stack now, instead of a string - `markdown-to-ast` lets us filter for only HTML blocks - `htmlparser2` lets us check if we're inside a heading *or* pre tag

thlorenz · 2016-06-01T16:30:09Z

Awesome work - fixes so many issues :) !
LGTM just hoping that the markdown to ast part doesn't affect performance especially for large Readmes too much. Have you done a quick time ... test of this branch against master?

However correctness is preferred either way (just interested how big the hit is), so please merge to master and I'll publish a new version.

Thanks.

jez · 2016-06-01T16:55:23Z

LGTM just hoping that the markdown to ast part doesn't affect performance especially for large Readmes too much. Have you done a quick time ... test of this branch against master?

Good call. Not sure what the best way to test for regressions is, but it seems to still do fine on the test suite and on one ~600 line README from one of my other projects:

# origin/master
npm test &> /dev/null  1.55s user 0.20s system 104% cpu 1.661 total
doctoc.js test/fixtures/cmugpi.github.io-README.md  0.30s user 0.04s system 96% cpu 0.352 total

# jez/fiz/headers-in-codeblocks
npm test &> /dev/null  1.61s user 0.20s system 104% cpu 1.726 total
doctoc.js test/fixtures/cmugpi.github.io-README.md  0.42s user 0.05s system 102% cpu 0.453 total

	npm test	cmugpi
origin/master	1.661s	0.352s
jez/fix/headers-in-codeblocks	1.726	0.453s

The test suite includes a lot of small tests, as well as a couple a couple longer fixture files. The extra README I found is ~2.5x longer than any of the fixtures. We're not going up by more than 0.1s in either case. If we really cared about performance, we could probably markdown parse the file once (we're doing it twice right now).

thlorenz · 2016-06-01T17:32:59Z

Thanks published as minor upgrade (since some behavior may have changed).

Add two failing header-in-codeblock tests

86f781a

Both of these tests should be fixed before we can consider thlorenz#37, thlorenz#38, and thlorenz#88 as resolved. Credit goes to @yomed for `test/fixtures/readme-with-code.md`.

jez assigned thlorenz May 30, 2016

jez force-pushed the fix/headers-in-codeblocks branch from 8321b9c to 8612165 Compare May 30, 2016 23:04

jez force-pushed the fix/headers-in-codeblocks branch from 8612165 to 5ae9a34 Compare May 30, 2016 23:17

thlorenz assigned jez Jun 1, 2016

jez merged commit 5ae9a34 into thlorenz:master Jun 1, 2016

jez deleted the fix/headers-in-codeblocks branch June 1, 2016 16:57

This was referenced Jun 1, 2016

skip contents in <pre></pre> #38

Closed

ignore headers within code blocks #84

Closed

Do not parse code #88

Closed

doctoc failing with https://github.com/iogf/untwisted/blob/master/BOOK.md #99

Closed

RyanZim mentioned this pull request Aug 30, 2016

Can't regex process <hX> when X is an integer #36

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Use markdown + html parsers to detect codeblocks #107

Use markdown + html parsers to detect codeblocks #107

Uh oh!

jez commented May 30, 2016 •

edited

Loading

Uh oh!

jez commented May 30, 2016

Uh oh!

thlorenz commented Jun 1, 2016 •

edited

Loading

Uh oh!

jez commented Jun 1, 2016 •

edited

Loading

Uh oh!

thlorenz commented Jun 1, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Use markdown + html parsers to detect codeblocks #107

Use markdown + html parsers to detect codeblocks #107

Uh oh!

Conversation

jez commented May 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jez commented May 30, 2016

Uh oh!

thlorenz commented Jun 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jez commented Jun 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thlorenz commented Jun 1, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jez commented May 30, 2016 •

edited

Loading

thlorenz commented Jun 1, 2016 •

edited

Loading

jez commented Jun 1, 2016 •

edited

Loading