code-sauce · Copilot · Oct 10, 2025 · Oct 10, 2025 · Oct 10, 2025
diff --git a/README.md b/README.md
@@ -2,26 +2,49 @@
 
 Privacy-first document tagger using local AI. Automatically tags your files with up to 10 relevant tags.
 
-## Quick Start
-
-```bash
-python setup.py        # Interactive setup
-python docsifter.py    # Run the app
-```
-
 ## Features
 
 - 🔒 **100% Private** - All AI processing happens locally
 - 🏷️ **Smart Tagging** - Up to 10 tags per file
 - 🔍 **Search** - Find files by name or tags
 - 💼 **Financial Docs** - Detects I797, bank statements, tax forms, etc.
 - 💾 **Local Storage** - Tags saved in JSON database
+- 🌐 **Modern Web UI** - Clean, responsive interface using HTML/CSS/JS
+- 🎯 **Intelligent Filtering** - Only processes text-based documents with AI, uses heuristics for binary files
+
+## Two UIs Available
+
+### Web UI (Recommended)
+Modern, responsive web interface with real-time updates:
+```bash
+python web_ui.py
+# Then open http://localhost:5000 in your browser
+```
+
+### Desktop UI (Legacy)
+Traditional tkinter-based desktop application:
+```bash
+python docsifter.py
+```
+
+## Quick Start
+
+```bash
+python setup.py        # Interactive setup
+python web_ui.py       # Run the web app
+```
 
 ## How It Works
 
-1. Select folder (Documents, Desktop, Downloads)
-2. Click "Scan & Tag Files"
-3. Search your tagged files
+1. Choose between Web UI (recommended) or Desktop UI
+2. Select folder (Documents, Desktop, Downloads)
+3. Click "Scan & Tag Files"
+4. Search your tagged files
+
+**Important Notes:**
+- **Text-based documents** (PDF, Word, Excel, etc.) are analyzed by AI for intelligent tagging
+- **Binary files** (images, videos, audio, archives) get basic heuristic tags only - AI doesn't guess content it can't see
+- AI is conservative and only tags based on clear evidence in filenames and extensions
 
 Tags include:
 - Document type (pdf, image, spreadsheet)

diff --git a/USAGE.md b/USAGE.md
@@ -0,0 +1,163 @@
+# DocSifter Usage Guide
+
+## Quick Start
+
+### Option 1: Web UI (Recommended)
+
+```bash
+# Start the web server
+python web_ui.py
+
+# Open in browser
+http://localhost:5000
+```
+
+### Option 2: Desktop UI (Legacy)
+
+```bash
+# Start the tkinter app
+python docsifter.py
+```
+
+## Key Features
+
+### 1. Intelligent File Processing
+
+**Text-based documents** (AI-enabled when Ollama is running):
+- PDF, Word, Excel, PowerPoint documents
+- Code files (Python, JavaScript, Java, etc.)
+- Text files, JSON, YAML, Markdown
+- HTML, CSS, configuration files
+
+**Binary files** (Heuristic-only, no AI guessing):
+- Images: JPG, PNG, GIF
+- Videos: MP4, MOV
+- Audio: MP3, WAV
+- Archives: ZIP, RAR
+
+### 2. Conservative AI Tagging
+
+When Ollama is available, the AI:
+- Only tags based on clear evidence in filename and extension
+- Does NOT guess or hallucinate content
+- Returns minimal tags for generic filenames (e.g., "document.pdf" → ["document", "pdf"])
+- Returns detailed tags for specific filenames (e.g., "invoice_jan2024.pdf" → ["document", "pdf", "financial", "invoice", "2024"])
+
+### 3. Web UI Features
+
+- **Scan & Tag**: Select a folder and scan up to 20 files
+- **Real-time Progress**: Visual progress bar with status messages
+- **Search**: Type to filter by filename or tag (300ms debounce)
+- **Status Display**: Shows AI availability and file count
+- **Responsive Design**: Works on desktop and mobile
+
+## API Endpoints
+
+The web UI exposes these REST endpoints:
+
+- `GET /api/status` - System status and configuration
+- `POST /api/scan` - Start scanning a folder
+- `GET /api/progress` - Get scan progress
+- `GET /api/search?q=query` - Search tagged files
+- `GET /api/files` - Get all tagged files
+
+## Examples
+
+### Example 1: Financial Documents
+
+```
+invoice_jan2024.pdf → ["document", "pdf", "financial", "invoice", "billing", "2024", "recent"]
+bank_statement_2024.pdf → ["document", "pdf", "financial", "bank-statement", "banking", "2024", "recent"]
+```
+
+### Example 2: Immigration Documents
+
+```
+I797_approval.pdf → ["document", "pdf", "immigration", "i797", "uscis", "approval"]
+passport_copy.pdf → ["document", "pdf", "immigration", "passport", "travel", "id"]
+```
+
+### Example 3: Binary Files (No AI Guessing)
+
+```
+photo.jpg → ["image", "photo"]
+video.mp4 → ["video", "media"]
+screenshot.png → ["image", "graphic", "screenshot"]
+```
+
+## Configuration
+
+### Environment Variables
+
+Create a `.env` file:
+
+```bash
+OLLAMA_MODEL=llama3.2
+```
+
+### Supported Folders
+
+Default folders can be modified in the code:
+- Documents
+- Desktop
+- Downloads
+
+## Troubleshooting
+
+### Ollama Not Available
+
+If you see "⚠️ Heuristic Mode", Ollama is not running:
+
+```bash
+# Start Ollama (if installed)
+ollama serve
+
+# Or install Ollama
+brew install ollama  # macOS
+# Visit https://ollama.ai for other platforms
+```
+
+### Search Not Working
+
+The search feature:
+- Has a 300ms debounce (wait after typing)
+- Searches both filename and tags
+- Is case-insensitive
+- Returns empty results if no matches
+
+### Tags Not Saved
+
+Tags are saved to `tags_database.json` in the project directory. Make sure you have write permissions.
+
+## Development
+
+### Running Tests
+
+```bash
+# Test original UI
+python test_docsifter.py
+
+# Test web UI
+python test_web_ui.py
+```
+
+### Adding New File Types
+
+Edit the `is_text_based_file()` method in `web_ui.py` or `docsifter.py`:
+
+```python
+text_extensions = {
+    '.pdf', '.doc', '.docx', '.txt',
+    # Add your extension here
+    '.your_extension'
+}
+```
+
+### Customizing Tags
+
+Edit `generate_tags_heuristic()` to add custom tagging rules:
+
+```python
+if 'your_keyword' in name_lower:
+    tags.extend(['your', 'custom', 'tags'])
+```
diff --git a/docsifter.py b/docsifter.py
@@ -316,12 +316,33 @@ def scan_and_tag_files(self, folder_path: str, folder_name: str):
         finally:
             self.progress.stop()
 
+    def is_text_based_file(self, file_ext: str) -> bool:
+        """Check if file is text-based (not binary like images, videos, audio)"""
+        text_extensions = {
+            '.pdf', '.doc', '.docx', '.txt', '.rtf', '.odt',
+            '.xlsx', '.xls', '.csv', '.ods',
+            '.pptx', '.ppt', '.odp',
+            '.html', '.htm', '.xml', '.json', '.yaml', '.yml',
+            '.md', '.markdown', '.rst',
+            '.py', '.js', '.java', '.c', '.cpp', '.h', '.cs',
+            '.go', '.rs', '.rb', '.php', '.swift', '.kt',
+            '.css', '.scss', '.sass', '.less',
+            '.sql', '.sh', '.bash', '.ps1',
+            '.log', '.cfg', '.conf', '.ini', '.env'
+        }
+        return file_ext in text_extensions
+
     def generate_tags(self, file_path: str) -> List[str]:
         """Generate tags for a file using AI or fallback heuristics"""
         file_name = os.path.basename(file_path)
         file_ext = pathlib.Path(file_path).suffix.lower()
 
-        # Try AI-based tagging if available
+        # For binary files (images, videos, audio), only use basic heuristic tagging
+        # Don't use AI as it can't read the content and would just guess
+        if not self.is_text_based_file(file_ext):
+            return self.generate_tags_heuristic(file_name, file_ext)[:10]
+
+        # Try AI-based tagging for text-based documents
         if self.client:
             try:
                 tags = self.generate_tags_ai(file_path, file_name, file_ext)
@@ -343,21 +364,31 @@ def generate_tags_ai(self, file_path: str, file_name: str, file_ext: str) -> Lis
 - Extension: {file_ext}
 - Size: {file_size} bytes
 
-Generate up to 10 relevant tags. For financial/visa documents, include specific types:
-- Immigration docs: I797, I20, visa, passport, greencard, EAD, etc.
-- Financial docs: bank statement, tax return, W2, 1099, invoice, receipt, etc.
-- General: document type, category, year if present
+Generate tags based ONLY on the filename and extension. Do NOT guess or hallucinate content you cannot see.
+
+Rules:
+- If the filename clearly indicates document type (e.g., "invoice", "resume", "I797"), use that
+- For financial/immigration docs with specific patterns, include appropriate tags
+- If filename is generic (e.g., "document.pdf", "file.txt"), return basic tags only
+- Do NOT invent or guess specific details not evident in the filename
+- Maximum 10 tags, minimum 2 tags
+
+For specific document types:
+- Immigration: I797, I20, visa, passport, greencard, EAD (only if in filename)
+- Financial: bank statement, tax return, W2, 1099, invoice, receipt (only if in filename)
+- General: document type from extension, year if present in filename
 
 Return ONLY a JSON array of strings.
-Example: ["document", "financial", "bank-statement", "2024", "chase", "checking"]"""
+Example for "invoice_jan2024.pdf": ["document", "pdf", "financial", "invoice", "2024"]
+Example for "document.pdf": ["document", "pdf"]"""
 
         response = self.client.chat(
             model=self.ollama_model,
             messages=[
-                {"role": "system", "content": "You are a file tagger. Return only JSON arrays."},
+                {"role": "system", "content": "You are a conservative file tagger. Only tag based on clear evidence in filename and extension. Never guess or hallucinate. Return only JSON arrays."},
                 {"role": "user", "content": prompt}
             ],
-            options={"temperature": 0.7, "num_predict": 150}
+            options={"temperature": 0.3, "num_predict": 150}
         )
 
         tags_json = response['message']['content'].strip()
@@ -367,8 +398,17 @@ def generate_tags_ai(self, file_path: str, file_name: str, file_ext: str) -> Lis
             if json_match:
                 tags_json = json_match.group(0)
 
-        tags = json.loads(tags_json)
-        return tags
+        try:
+            tags = json.loads(tags_json)
+            # Validate tags are reasonable
+            if not isinstance(tags, list) or len(tags) == 0:
+                return []
+            # Filter out any non-string tags
+            tags = [str(tag) for tag in tags if isinstance(tag, str) and tag.strip()]
+            return tags
+        except (json.JSONDecodeError, ValueError):
+            print(f"Failed to parse AI response: {tags_json}")
+            return []
 
     def generate_tags_heuristic(self, file_name: str, file_ext: str) -> List[str]:
         """Generate tags using heuristic rules (fallback)"""

diff --git a/requirements.txt b/requirements.txt
@@ -1,2 +1,3 @@
 ollama>=0.1.0
 python-dotenv>=1.0.0
+flask>=2.3.0