Web Crawler

A simple and extensible web crawler written in Go.

Overview

This project is a basic web crawler designed to fetch, parse, and archive web pages into a MongoDB database. It demonstrates core crawling techniques such as queue-based URL management, polite crawling, parallel fetching, HTML parsing, and duplicate avoidance.

Features

Fetches and parses HTML from web pages
Extracts page titles, main body content, and hyperlinks
Enqueues new links for further crawling (breadth-first)
Polite crawling with delays between requests
Prevents duplicate crawling using hashed URL tracking
Stores page data (URL, title, content) in MongoDB
Modular structure (crawler, queue, db, main)

Getting Started

Prerequisites

Go 1.24+
A running MongoDB instance (local or remote)
git for cloning the repository

Installation

Clone the repository:

git clone https://github.com/sahitya-chandra/web-crawler.git
cd web-crawler

Install dependencies:
```
go mod download
```
Set up environment variables:
- Create a .env file in the root directory:
```
MONGODB_URI=mongodb://localhost:27017
```

Usage

Run the crawler:
```
go run main.go
```
By default, the crawler starts at https://example.com/ and archives up to 500 pages (can be changed in main.go).
Database Output:
- Crawled web pages are stored in the crawlerArchive.webpages collection in MongoDB.
- Each document contains:
  - url: The crawled URL
  - title: Page title
  - content: Main body content (first 500 words)

Project Structure

main.go — Entry point; orchestrates queueing, crawling, and database storage
crawler/ — Fetches and parses HTML, extracts links and content
queue/ — Thread-safe queue implementation for URLs
db/ — MongoDB connection and basic storage helpers

Example Output

Crawled: https://example.com/, Title: Example Domain

Configuration

Adjust the starting URL, maximum pages, or crawling logic in main.go as needed.
MongoDB URI and other secrets are managed via the .env file.

Dependencies

License

This project is for educational/demo purposes. No license specified.

Contributing

Pull requests and suggestions are welcome!

Results may be incomplete. See the GitHub code search results for more.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
crawler		crawler
db		db
queue		queue
.gitignore		.gitignore
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web Crawler

Overview

Features

Getting Started

Prerequisites

Installation

Usage

Project Structure

Example Output

Configuration

Dependencies

License

Contributing

About

Uh oh!

Releases

Packages

Languages

sahitya-chandra/web-crawler

Folders and files

Latest commit

History

Repository files navigation

Web Crawler

Overview

Features

Getting Started

Prerequisites

Installation

Usage

Project Structure

Example Output

Configuration

Dependencies

License

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages