Skip to content

Conversation

benjaminma
Copy link
Contributor

Lowercasing encountered element names to walk <A>, <P>, <TaBLe>, etc. in walk()'s tag switch block

@mlegenhausen
Copy link
Member

Thanks for the pull request. I will take a look.

@benjaminma
Copy link
Contributor Author

Ah, I see. What do you think about terminating mailto: anchors and ignoring the inner text, but walking children of any non-mailto: anchor? Is the href necessary to output in those cases?

e.g. <a href="http://www.google.com">Google</a> or
<a href="#more-something"><img src="something-thumb.jpg"><div class="caption"><span>Something caption...</span></div></a>

My use case is to extract as much available text from an html doc.

@benjaminma
Copy link
Contributor Author

I think I intended to send the PR from a feature branch. Let me see if I can split up these commits into a separate issue.

@benjaminma
Copy link
Contributor Author

Split request. See #7 and #8.

@benjaminma benjaminma closed this Jul 11, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants