Bridging Ancient Wisdom with Modern Technology
A comprehensive digital platform for accessing, analyzing, and exploring Vedic literature through web scraping, database management, NLP analysis, and knowledge graph visualization.
- Overview
- Project Genesis
- Key Features
- Architecture
- Data Model
- Technology Stack
- Project Objectives
- Text Sources
- Installation
- Usage
- Project Structure
- Roadmap
- Contributing
- Risks & Mitigations
- References
- License
- Contact
Code4Ved is a self-directed learning project that combines programming expertise with the study of ancient Indian texts (Vedas, Puranas, Upanishads, Samhitas, Epic Poems). The platform aims to:
- Centralize Access: Aggregate texts from 15+ Sanskrit repositories into a unified database
- Enable Analysis: Apply NLP techniques to extract philosophical concepts and themes
- Visualize Relationships: Create knowledge graphs connecting ancient wisdom with modern science
- Build Learning Paths: Generate interactive roadmaps for systematic study
- Develop Skills: Gain hands-on experience with Python, SQL, web scraping, NLP, and graph databases
Make 1000+ years of ancient wisdom accessible through modern computational tools while creating reusable open-source infrastructure for the Sanskrit studies community.
This project was initiated in April 2025 following extensive research consultations with multiple AI assistants (Gemini, Copilot, Claude, Perplexity, DeepSeek, Grok, Krutrim, and others) exploring ways to combine programming skills with the study of Vedic literature.
- 2025-04-18: Initial question posed: "How can I leverage programming skills while studying Vedas, Puranas, and Upanishads?"
- 2025-04-19: Identified 5 core project ideas and compiled initial list of text sources
- 2025-04-26: Refined technical approach across multiple AI consultations
- 2025-04-27: Consolidated resources and implementation strategies
- 2025-05-19: Generated Mermaid classification charts using Python
- 2025-10-04: Formalized project charter and documentation structure
- Automated text extraction from 15+ Sanskrit repositories
- Support for PDF, HTML, and plain text formats
- Ethical scraping with rate limiting and robots.txt compliance
- Comprehensive scraping logs for audit trails
Texts classified across 5 dimensions:
- Language: English, Hindi, Sanskrit, Mixed
- Format: PDF, HTML, Plain Text
- Category: Vedas, Puranas, Upanishads, Samhitas, Epic Poems
- Philosophical Concepts: Atman, Brahman, Karma, Dharma, Moksha, Maya, Yoga, Bhakti
- Themes: Spirituality, Philosophy, War, Medicine, Mathematics, Art, Duty, Ritual, Cosmology
- PostgreSQL: Structured metadata and relationships
- MongoDB: Unstructured annotations and commentary
- Neo4j: Graph database for concept relationships
- SQLite3: Lightweight development and prototyping
- Keyword extraction using NLTK and spaCy
- Topic modeling with LDA
- Concept co-occurrence analysis
- Automated concept tagging with confidence scores
- Manual validation workflow
- Visual representation of relationships between:
- Philosophical concepts (Atman, Brahman, Karma)
- Scientific topics (Quantum Mechanics, Cosmology, Mathematics)
- Texts and their interconnections
- Graph traversal queries for exploration
-
500 nodes and >1000 relationships
- Interactive flowcharts following GitHub roadmap patterns
- Recommended reading order
- Prerequisite mapping
- Exportable as SVG/PNG
βββββββββββββββββββββββ
β Web Scrapers β β Python/GoLang/Rust
β (BeautifulSoup, β
β Scrapy, reqwest) β
ββββββββββββ¬βββββββββββ
β
βΌ
βββββββββββββββββββββββ
β Data Processing β β PyPDF2, text cleaning
β & Extraction β pdfminer.six
ββββββββββββ¬βββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββ
β Database Layer β
β βββββββββββ¬ββββββββββ¬ββββββββββββ β
β βPostgreSQLβ MongoDB β Neo4j β β
β β(metadata)β(comments)β (graph) β β
β βββββββββββ΄ββββββββββ΄ββββββββββββ β
ββββββββββββ¬βββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββ
β NLP Pipeline β β NLTK, spaCy
β (Analysis) β scikit-learn
ββββββββββββ¬βββββββββββ
β
βΌ
βββββββββββββββββββββββ
β Visualization β β Mermaid, NetworkX
β & Query Interface β Matplotlib, D3.js
βββββββββββββββββββββββ
Texts (1) ββββ (M) Text_Concepts ββββ (M) Concepts
β β
β β
βββββ (M) Text_Themes ββββ (M) Themes β
β
(Graph) Concept_Relationships
β
(Graph) Scientific_Topics
Core Tables:
texts
: Title, language, format, category, source URL, content, content_hashphilosophical_concepts
: Concept name, Sanskrit term, description, categorythemes
: Theme name, descriptiontext_concepts
: Junction table with occurrence count, context, confidencetext_themes
: Junction table with relevance scoresscraping_log
: Audit trail of all scraping activities
Key Features:
- Full-text search indexes on content and titles
- Content hash for duplicate detection
- Foreign key relationships with CASCADE delete
- CHECK constraints on enumerations
- Confidence scores (0.0-1.0) for automated extractions
Collection: annotations
{
"text_id": 123,
"annotations": [{
"annotation_id": "uuid",
"type": "commentary|note|translation|question",
"content": "Annotation text",
"tags": ["atman", "metaphysics"],
"verse_reference": "Chapter 2, Verse 15"
}]
}
Node Types:
PhilosophicalConcept
: {concept_id, name, sanskrit_term, category}ScientificTopic
: {topic_id, name, field}Text
: {text_id, title, category}
Relationship Types:
RELATES_TO
: Concept β Concept (with strength, cooccurrence_count)CONNECTS_TO
: Concept β Scientific Topic (with connection_strength, evidence)MENTIONED_IN
: Concept β Text (with occurrence_count, prominence)
Component | Technologies | Purpose |
---|---|---|
Web Scraping | Python (requests, BeautifulSoup, Scrapy), Bash (wget, curl), Rust (reqwest, scraper), Go (colly) | Text extraction from websites |
PDF Processing | PyPDF2, pdfminer.six, pdfplumber | Extract text from PDF files |
Databases | PostgreSQL (structured data), MongoDB (unstructured data), Neo4j (graph), SQLite3 (development) | Multi-database architecture |
NLP | NLTK, spaCy, scikit-learn (LDA) | Text analysis, concept extraction, topic modeling |
Visualization | Mermaid.js, Graphviz, NetworkX, Matplotlib, D3.js | Flowcharts, graphs, roadmaps |
Languages | Python (primary), GoLang (performance), Rust (safety), C++ (optimization) | Progressive skill development |
Version Control | Git, GitHub | Code management and collaboration |
Automation | Ansible, Bash scripts | Deployment and task automation |
- Create comprehensive database of 100+ texts from 10+ major repositories
- Build automated web scraping tools for text extraction
- Develop NLP pipeline for Sanskrit text analysis
- Implement knowledge graph connecting philosophy with modern science
- Design interactive learning path visualization
- β Extract and classify 100+ texts from at least 10 websites
- β Database contains multi-dimensional classifications
- β NLP tools achieve >80% accuracy on manual validation
- β Knowledge graph contains >500 nodes and >1000 relationships
- β Interactive learning roadmap created and accessible
- β Well-documented codebase for future enhancements
Source | URL | Type | Notes |
---|---|---|---|
Vedic Heritage Portal | https://vedicheritage.gov.in/ | Official govt repository | Comprehensive, high reliability |
GRETIL | http://gretil.sub.uni-goettingen.de/ | Academic | University of GΓΆttingen, peer-reviewed |
Ambuda.org | https://ambuda.org/ | Open Source | Active development, tools available |
GITA Supersite | https://www.gitasupersite.iitk.ac.in/ | Academic | IIT Kanpur, comprehensive |
Sanskrit Library | https://sanskritlibrary.org/ | Academic | High academic standards |
TITUS | https://titus.fkidg1.uni-frankfurt.de/ | Academic | University Frankfurt linguistic database |
IGNCA | https://ignca.gov.in/divisionss/asi-books/ | Government | Indira Gandhi National Centre for the Arts |
Sanskrit DCS | http://www.sanskrit-linguistics.org/dcs/ | Academic | Digital Corpus with linguistic tools |
Source | URL | Type | Notes |
---|---|---|---|
Sanskrit Documents | https://sanskritdocuments.org/ | Community | Large collection, varying quality |
Sacred Texts | https://www.sacred-texts.com/hin/ | Archive | Historical archive, public domain |
Ved Puran | https://vedpuran.net/ | Community | PDF downloads available |
Sanskrit Books | https://www.sanskritebooks.org/ | Digital Library | Wide collection |
Veducation World | https://www.veducation.world/library | Educational | Learning-focused content |
Adhyeta.org | https://www.adhyeta.org.in/ | Learning Platform | Educational resources |
- Ambuda Repository: https://github.com/ambuda-org - Sanskrit tools reference
- Awesome Roadmaps: https://github.com/liuchong/awesome-roadmaps - Learning path patterns
# Python 3.8+
python --version
# PostgreSQL
psql --version
# MongoDB
mongod --version
# Neo4j (optional, for graph database)
neo4j version
# Clone the repository
git clone https://github.com/RustyNails8/Code4Ved.git
cd Code4Ved
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt
# Set up databases
# PostgreSQL
createdb code4ved
psql code4ved < schema/postgres_schema.sql
# MongoDB (ensure mongod is running)
mongosh
use vedic_texts
db.createCollection("annotations")
# Neo4j (optional)
# Start Neo4j and create database via web interface
# Copy example config
cp config.example.yml config.yml
# Edit config.yml with your database credentials
nano config.yml
# Scrape a single website
python src/code4ved/scrapers/scrape_text.py --url https://vedicheritage.gov.in/
# Batch scraping from source list
python src/code4ved/scrapers/batch_scrape.py --sources config/sources.yml
# Classify extracted texts
python src/code4ved/classify/classify_texts.py --input data/raw/ --output data/processed/
# Manually review classifications
python src/code4ved/classify/review_tool.py
# Extract keywords
python src/code4ved/nlp/extract_keywords.py --text-id 123
# Topic modeling
python src/code4ved/nlp/topic_model.py --corpus data/processed/
# Concept extraction
python src/code4ved/nlp/extract_concepts.py --text-id 123
# Insert text into database
python src/code4ved/db/insert_text.py --file data/processed/rigveda.txt
# Query database
python src/code4ved/db/query.py --category Vedas --language Sanskrit
# Export to JSON
python src/code4ved/db/export.py --format json --output exports/
# Generate learning roadmap
python src/code4ved/viz/generate_roadmap.py --output assets/roadmap.svg
# Create concept graph
python src/code4ved/viz/concept_graph.py --output assets/concept_graph.html
Code4Ved/
βββ README.md # This file
βββ LICENSE # MIT License
βββ requirements.txt # Python dependencies
βββ requirements-dev.txt # Development dependencies
βββ pyproject.toml # Project configuration
βββ Makefile # Build automation
β
βββ src/ # Source code
β βββ code4ved/
β βββ __init__.py
β βββ scrapers/ # Web scraping modules
β βββ db/ # Database operations
β βββ nlp/ # NLP pipeline
β βββ viz/ # Visualization tools
β βββ utils/ # Utility functions
β
βββ data/ # Data directory
β βββ external/ # Research notes and external data
β βββ raw/ # Raw scraped data
β βββ processed/ # Processed and classified data
β
βββ docs/ # Documentation
β βββ agents/ # Agent-based workflow docs
β βββ guides/ # User guides
β βββ source/ # Source documentation
β
βββ project-management/ # Project management artifacts
β βββ 00_inbox/ # Ideas and triage
β βββ 01_project-charter/ # Charter, scope, stakeholders
β βββ 02_research/ # Research notes and references
β βββ 03_specifications/ # FSD, TSD, NFR, data model
β βββ 04_planning/ # Roadmap, milestones, WBS
β βββ 05_design/ # Architecture, design docs
β βββ 06_implementation/ # Development plans
β βββ 07_testing/ # Test strategy and cases
β βββ 08_release/ # Release plans
β βββ 09_operations/ # SLA, monitoring, runbooks
β βββ 10_documentation/ # User/admin guides
β βββ 11_retrospective/ # Lessons learned
β βββ 12_risks/ # Risk register, decisions
β
βββ tests/ # Test suite
β βββ 01_FUT_functional_unit_tests/
β βββ 02_SIT_system_integration_tests/
β βββ 03_UAT_E2E_end-to-end_tests/
β
βββ scripts/ # Utility scripts
βββ examples/ # Example usage
βββ assets/ # Images, diagrams, charts
βββ logs/ # Application logs
βββ prompts/ # AI agent prompts
βββ Manager_Agent/
βββ Implementation_Agent/
βββ Setup_Agent/
βββ ad-hoc/
βββ guides/
βββ schemas/
- Research AI consultations (2025-04-18 to 2025-04-27)
- Identify text sources (15+ websites)
- Build web scraping module (Python)
- Extract texts from 10 primary sources
- Set up PostgreSQL and MongoDB databases
- Implement text classification system
- Populate databases with extracted texts
- Manual validation of 100 sample texts
- Build query interface
- Create database backup procedures
- Develop keyword extraction pipeline
- Implement topic modeling (LDA)
- Build concept extraction tools
- Achieve >80% accuracy on validation set
- Create annotation workflow
- Set up Neo4j graph database
- Define node and relationship types
- Populate graph with philosophical concepts
- Create connections to scientific topics
- Build graph visualization tools
- Generate learning roadmap (Mermaid)
- Create concept relationship graphs
- Build web interface (optional)
- Complete documentation
- Publish to GitHub
- Explore GoLang/Rust implementations for performance
- Add Sanskrit-specific NLP tools
- Implement collaborative annotation features
- Develop mobile application
- Create REST API for external access
- Build recommendation system for reading paths
Contributions are welcome! This is an open-source project aimed at making Vedic literature more accessible.
- Fork the repository
- Create a feature branch:
git checkout -b feature/YourFeature
- Commit changes:
git commit -m 'Add YourFeature'
- Push to branch:
git push origin feature/YourFeature
- Open a Pull Request
- Text Sources: Identify additional reliable Sanskrit repositories
- NLP Improvements: Enhance Sanskrit text processing
- Classification: Help tag and validate text classifications
- Visualization: Improve roadmap and graph visualizations
- Documentation: Improve guides and tutorials
- Testing: Add unit tests and integration tests
- Bug Fixes: Report and fix bugs
- Follow PEP 8 for Python code
- Write docstrings for all functions
- Add unit tests for new features
- Update documentation as needed
Risk | Probability | Impact | Mitigation Strategy |
---|---|---|---|
Copyright/Licensing Violations | Medium | High | - Use only verified public domain sources - Check website ToS before scraping - Maintain clear attribution - Prioritize government/academic sources |
Sanskrit NLP Accuracy Limitations | High | Medium | - Start with English translations - Set confidence thresholds - Implement manual validation - Accept higher curation burden |
Risk | Probability | Impact | Mitigation Strategy |
---|---|---|---|
Website Structure Changes | Medium | Medium | - Build flexible scrapers - Regular testing - Maintain test suite - Have backup sources |
Scope Creep | High | Medium | - Strict feature prioritization - MVP mindset - Track ideas in backlog - Regular scope reviews |
Time Commitment Conflicts | Medium | Medium | - Realistic scheduling - Modular development - Accept longer timeline - Focus on valuable components |
Data Quality Issues | Medium | Medium | - Prioritize high-quality sources - Data validation rules - Standardized preprocessing - Manual quality checks |
Data/Code Loss | Low | High | - Git version control - GitHub remote repository - Regular database backups - Test restore procedures |
Risk | Probability | Impact | Mitigation Strategy |
---|---|---|---|
Technical Complexity Exceeds Skills | Low | Medium | - Start with known technologies - Progressive learning - Extensive tutorials - Simplify if needed |
Database Performance Issues | Low | Low | - Proper indexing - Query optimization - Monitor performance - Implement caching |
AI Consultations (2025-04-18 to 2025-04-27):
- Gemini (Google AI): Project approach, technical architecture
- Copilot (Microsoft/GitHub): Implementation strategies, NLP pipelines
- Claude (Anthropic): Systematic approach, learning applications
- Perplexity AI: Academic connections, research papers
- DeepSeek AI: Code examples, implementations
- Grok (xAI): Comprehensive methodology
- Meta AI: Technology options, knowledge graphs
- Krutrim AI: Indian context, Sanskrit NLP tools
- Khoj AI: Text analysis, database creation
- DuckDuckGo AI: Computational approaches
- Poe Platform: Tool development
Programming & Tools:
- Python Software Foundation - https://python.org
- PostgreSQL Documentation - https://postgresql.org
- MongoDB Documentation - https://mongodb.com
- Neo4j Documentation - https://neo4j.com
- NLTK Project - https://www.nltk.org
- spaCy - https://spacy.io
- BeautifulSoup & Scrapy - Web scraping libraries
Recommended Books:
- "The Upanishads" by Eknath Easwaran (Nilgiri Press)
- "The Bhagavata Purana" by Bibek Debroy (Penguin)
- "The Rust Programming Language" by Steve Klabnik, Carol Nichols
- "The Go Programming Language" by Alan Donovan, Brian Kernighan
Academic Resources:
- Digital Humanities best practices
- Sanskrit NLP research papers
- Graph database optimization guides
- Web scraping ethics guidelines
- All Sanskrit texts from public domain (>70 years old) or openly licensed sources
- Proper attribution maintained for all source repositories
- Compliance with robots.txt for web scraping
- Creative Commons and GPL licenses respected
This project is licensed under the MIT License - see the LICENSE file for details.
This project is committed to open-source principles:
- Free to use, modify, and distribute
- Contributions welcome from the community
- Tools designed for reusability
- Documentation provided for transparency
Project Maintainer: Sumit Das
Project Repository: https://github.com/RustyNails8/Code4Ved
Issues & Support: Please use GitHub Issues for bug reports, feature requests, and questions.
Discussions: Join the discussion on GitHub Discussions for general questions and community interaction.
- All AI assistants consulted during research phase (April 2025)
- Sanskrit text repository maintainers for preserving ancient wisdom
- Open-source community for tools and libraries
- Academic institutions maintaining high-quality text databases
- Digital Humanities community for methodological guidance
- Started: April 2025
- Status: Active Development
- Primary Language: Python
- Database Systems: 3 (PostgreSQL, MongoDB, Neo4j)
- Text Sources: 15+ websites
- Target Texts: 100+ documents
- Target Concepts: 100-200 unique concepts
- Target Graph Relationships: 1000-5000 edges
This project serves as:
- Learning Platform: Hands-on experience with modern data engineering
- Skill Development: Python, SQL, NLP, web scraping, graph databases
- Cultural Bridge: Connecting ancient philosophy with modern technology
- Portfolio Project: Demonstrable work for career advancement
- Community Contribution: Open-source tools for Sanskrit studies
- Interdisciplinary Study: Philosophy, spirituality, computer science, linguistics
Made with β€οΈ for ancient wisdom and modern technology
Last Updated: October 4, 2025
This comprehensive README.md collates information from all the specified files including:
- Project charter details (charter.md)
- Business case and justification (business_case.md)
- Stakeholder information (stakeholders.md)
- Research findings from daily notes (external data files)
- Technical specifications (FSD.md, data_model.md)
- Risk register (risk_register_plan.md)
- References and resources (references.md)
The README provides a complete overview of the Code4Ved project suitable for GitHub and project documentation purposes.