A simple and extensible web crawler written in Go.
This project is a basic web crawler designed to fetch, parse, and archive web pages into a MongoDB database. It demonstrates core crawling techniques such as queue-based URL management, polite crawling, parallel fetching, HTML parsing, and duplicate avoidance.
- Fetches and parses HTML from web pages
- Extracts page titles, main body content, and hyperlinks
- Enqueues new links for further crawling (breadth-first)
- Polite crawling with delays between requests
- Prevents duplicate crawling using hashed URL tracking
- Stores page data (URL, title, content) in MongoDB
- Modular structure (crawler, queue, db, main)
- Go 1.24+
- A running MongoDB instance (local or remote)
gitfor cloning the repository
-
Clone the repository:
git clone https://github.com/sahitya-chandra/web-crawler.git cd web-crawler -
Install dependencies:
go mod download
-
Set up environment variables:
- Create a
.envfile in the root directory:MONGODB_URI=mongodb://localhost:27017
- Create a
-
Run the crawler:
go run main.go
By default, the crawler starts at
https://example.com/and archives up to 500 pages (can be changed inmain.go). -
Database Output:
- Crawled web pages are stored in the
crawlerArchive.webpagescollection in MongoDB. - Each document contains:
url: The crawled URLtitle: Page titlecontent: Main body content (first 500 words)
- Crawled web pages are stored in the
main.go— Entry point; orchestrates queueing, crawling, and database storagecrawler/— Fetches and parses HTML, extracts links and contentqueue/— Thread-safe queue implementation for URLsdb/— MongoDB connection and basic storage helpers
Crawled: https://example.com/, Title: Example Domain
- Adjust the starting URL, maximum pages, or crawling logic in
main.goas needed. - MongoDB URI and other secrets are managed via the
.envfile.
This project is for educational/demo purposes. No license specified.
Pull requests and suggestions are welcome!
Results may be incomplete. See the GitHub code search results for more.