This is a tool for performing recursive regular expression content searches of a webpage. Recursive here means exploring pages linked to in the content of a page.
usage: webgrep.rkt [ <option> ... ]
where <option> is one of
*-c <regexp>, --content-pat <regexp> : Patterns to find in page contents.
-u <regexp>, --url-pat <regexp> : Pattern of links to find in page contents.
Default: any link
-r <url>, --root <url> : Url at which to root search.
-d <natural>, --depth <natural> : Depth of linked pages to traverse.
Default: 2
-l, --links-only : Only show links of pages with matching content,
not content matches as well.
-j <natural>, --thread-limit <natural> : Maximum number of threads to search in parallel.
Default: 50
--help, -h : Show this help
-- : Do not treat any remaining argument as a switch (at this level)
* Asterisks indicate options allowed multiple times.
Multiple single-letter switches can be combined after one `-'; for
example: `-h-' is the same as `-h --'
usage: wgrep.py [-h] [-l LINK] [-u URLS] [-c CONTENT] [-d DEPTH] [-n]
optional arguments:
-h, --help show this help message and exit
-l LINK, --link LINK The link to root the search. *This is a mandatory
argument.*
-u URLS, --urls URLS Regexp for urls to search. By default all urls are
searched.
-c CONTENT, --content CONTENT
Regexp for page content to find. *This is a mandatory
argument.*
-d DEPTH, --depth DEPTH
The page depth to search. Default: 2.
-n, --negate Negate content regexp: Print only page urls that do
*not* contain the content pattern.
Racket version:
Python version:
- Python 2.7
- BeautifulSoup
- Pretty printing with colors
- More machine-parsable output
Option for case sensitivityMultiple patternsImprove performance (?)Use the racket implementation!
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.