Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 33 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,26 +2,49 @@

Privacy-first document tagger using local AI. Automatically tags your files with up to 10 relevant tags.

## Quick Start

```bash
python setup.py # Interactive setup
python docsifter.py # Run the app
```

## Features

- 🔒 **100% Private** - All AI processing happens locally
- 🏷️ **Smart Tagging** - Up to 10 tags per file
- 🔍 **Search** - Find files by name or tags
- 💼 **Financial Docs** - Detects I797, bank statements, tax forms, etc.
- 💾 **Local Storage** - Tags saved in JSON database
- 🌐 **Modern Web UI** - Clean, responsive interface using HTML/CSS/JS
- 🎯 **Intelligent Filtering** - Only processes text-based documents with AI, uses heuristics for binary files

## Two UIs Available

### Web UI (Recommended)
Modern, responsive web interface with real-time updates:
```bash
python web_ui.py
# Then open http://localhost:5000 in your browser
```

### Desktop UI (Legacy)
Traditional tkinter-based desktop application:
```bash
python docsifter.py
```

## Quick Start

```bash
python setup.py # Interactive setup
python web_ui.py # Run the web app
```

## How It Works

1. Select folder (Documents, Desktop, Downloads)
2. Click "Scan & Tag Files"
3. Search your tagged files
1. Choose between Web UI (recommended) or Desktop UI
2. Select folder (Documents, Desktop, Downloads)
3. Click "Scan & Tag Files"
4. Search your tagged files

**Important Notes:**
- **Text-based documents** (PDF, Word, Excel, etc.) are analyzed by AI for intelligent tagging
- **Binary files** (images, videos, audio, archives) get basic heuristic tags only - AI doesn't guess content it can't see
- AI is conservative and only tags based on clear evidence in filenames and extensions

Tags include:
- Document type (pdf, image, spreadsheet)
Expand Down
163 changes: 163 additions & 0 deletions USAGE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
# DocSifter Usage Guide

## Quick Start

### Option 1: Web UI (Recommended)

```bash
# Start the web server
python web_ui.py

# Open in browser
http://localhost:5000
```

### Option 2: Desktop UI (Legacy)

```bash
# Start the tkinter app
python docsifter.py
```

## Key Features

### 1. Intelligent File Processing

**Text-based documents** (AI-enabled when Ollama is running):
- PDF, Word, Excel, PowerPoint documents
- Code files (Python, JavaScript, Java, etc.)
- Text files, JSON, YAML, Markdown
- HTML, CSS, configuration files

**Binary files** (Heuristic-only, no AI guessing):
- Images: JPG, PNG, GIF
- Videos: MP4, MOV
- Audio: MP3, WAV
- Archives: ZIP, RAR

### 2. Conservative AI Tagging

When Ollama is available, the AI:
- Only tags based on clear evidence in filename and extension
- Does NOT guess or hallucinate content
- Returns minimal tags for generic filenames (e.g., "document.pdf" → ["document", "pdf"])
- Returns detailed tags for specific filenames (e.g., "invoice_jan2024.pdf" → ["document", "pdf", "financial", "invoice", "2024"])

### 3. Web UI Features

- **Scan & Tag**: Select a folder and scan up to 20 files
- **Real-time Progress**: Visual progress bar with status messages
- **Search**: Type to filter by filename or tag (300ms debounce)
- **Status Display**: Shows AI availability and file count
- **Responsive Design**: Works on desktop and mobile

## API Endpoints

The web UI exposes these REST endpoints:

- `GET /api/status` - System status and configuration
- `POST /api/scan` - Start scanning a folder
- `GET /api/progress` - Get scan progress
- `GET /api/search?q=query` - Search tagged files
- `GET /api/files` - Get all tagged files

## Examples

### Example 1: Financial Documents

```
invoice_jan2024.pdf → ["document", "pdf", "financial", "invoice", "billing", "2024", "recent"]
bank_statement_2024.pdf → ["document", "pdf", "financial", "bank-statement", "banking", "2024", "recent"]
```

### Example 2: Immigration Documents

```
I797_approval.pdf → ["document", "pdf", "immigration", "i797", "uscis", "approval"]
passport_copy.pdf → ["document", "pdf", "immigration", "passport", "travel", "id"]
```

### Example 3: Binary Files (No AI Guessing)

```
photo.jpg → ["image", "photo"]
video.mp4 → ["video", "media"]
screenshot.png → ["image", "graphic", "screenshot"]
```

## Configuration

### Environment Variables

Create a `.env` file:

```bash
OLLAMA_MODEL=llama3.2
```

### Supported Folders

Default folders can be modified in the code:
- Documents
- Desktop
- Downloads

## Troubleshooting

### Ollama Not Available

If you see "⚠️ Heuristic Mode", Ollama is not running:

```bash
# Start Ollama (if installed)
ollama serve

# Or install Ollama
brew install ollama # macOS
# Visit https://ollama.ai for other platforms
```

### Search Not Working

The search feature:
- Has a 300ms debounce (wait after typing)
- Searches both filename and tags
- Is case-insensitive
- Returns empty results if no matches

### Tags Not Saved

Tags are saved to `tags_database.json` in the project directory. Make sure you have write permissions.

## Development

### Running Tests

```bash
# Test original UI
python test_docsifter.py

# Test web UI
python test_web_ui.py
```

### Adding New File Types

Edit the `is_text_based_file()` method in `web_ui.py` or `docsifter.py`:

```python
text_extensions = {
'.pdf', '.doc', '.docx', '.txt',
# Add your extension here
'.your_extension'
}
```

### Customizing Tags

Edit `generate_tags_heuristic()` to add custom tagging rules:

```python
if 'your_keyword' in name_lower:
tags.extend(['your', 'custom', 'tags'])
```
60 changes: 50 additions & 10 deletions docsifter.py
Original file line number Diff line number Diff line change
Expand Up @@ -316,12 +316,33 @@ def scan_and_tag_files(self, folder_path: str, folder_name: str):
finally:
self.progress.stop()

def is_text_based_file(self, file_ext: str) -> bool:
"""Check if file is text-based (not binary like images, videos, audio)"""
text_extensions = {
'.pdf', '.doc', '.docx', '.txt', '.rtf', '.odt',
'.xlsx', '.xls', '.csv', '.ods',
'.pptx', '.ppt', '.odp',
'.html', '.htm', '.xml', '.json', '.yaml', '.yml',
'.md', '.markdown', '.rst',
'.py', '.js', '.java', '.c', '.cpp', '.h', '.cs',
'.go', '.rs', '.rb', '.php', '.swift', '.kt',
'.css', '.scss', '.sass', '.less',
'.sql', '.sh', '.bash', '.ps1',
'.log', '.cfg', '.conf', '.ini', '.env'
}
return file_ext in text_extensions

def generate_tags(self, file_path: str) -> List[str]:
"""Generate tags for a file using AI or fallback heuristics"""
file_name = os.path.basename(file_path)
file_ext = pathlib.Path(file_path).suffix.lower()

# Try AI-based tagging if available
# For binary files (images, videos, audio), only use basic heuristic tagging
# Don't use AI as it can't read the content and would just guess
if not self.is_text_based_file(file_ext):
return self.generate_tags_heuristic(file_name, file_ext)[:10]

# Try AI-based tagging for text-based documents
if self.client:
try:
tags = self.generate_tags_ai(file_path, file_name, file_ext)
Expand All @@ -343,21 +364,31 @@ def generate_tags_ai(self, file_path: str, file_name: str, file_ext: str) -> Lis
- Extension: {file_ext}
- Size: {file_size} bytes

Generate up to 10 relevant tags. For financial/visa documents, include specific types:
- Immigration docs: I797, I20, visa, passport, greencard, EAD, etc.
- Financial docs: bank statement, tax return, W2, 1099, invoice, receipt, etc.
- General: document type, category, year if present
Generate tags based ONLY on the filename and extension. Do NOT guess or hallucinate content you cannot see.

Rules:
- If the filename clearly indicates document type (e.g., "invoice", "resume", "I797"), use that
- For financial/immigration docs with specific patterns, include appropriate tags
- If filename is generic (e.g., "document.pdf", "file.txt"), return basic tags only
- Do NOT invent or guess specific details not evident in the filename
- Maximum 10 tags, minimum 2 tags

For specific document types:
- Immigration: I797, I20, visa, passport, greencard, EAD (only if in filename)
- Financial: bank statement, tax return, W2, 1099, invoice, receipt (only if in filename)
- General: document type from extension, year if present in filename

Return ONLY a JSON array of strings.
Example: ["document", "financial", "bank-statement", "2024", "chase", "checking"]"""
Example for "invoice_jan2024.pdf": ["document", "pdf", "financial", "invoice", "2024"]
Example for "document.pdf": ["document", "pdf"]"""

response = self.client.chat(
model=self.ollama_model,
messages=[
{"role": "system", "content": "You are a file tagger. Return only JSON arrays."},
{"role": "system", "content": "You are a conservative file tagger. Only tag based on clear evidence in filename and extension. Never guess or hallucinate. Return only JSON arrays."},
{"role": "user", "content": prompt}
],
options={"temperature": 0.7, "num_predict": 150}
options={"temperature": 0.3, "num_predict": 150}
)

tags_json = response['message']['content'].strip()
Expand All @@ -367,8 +398,17 @@ def generate_tags_ai(self, file_path: str, file_name: str, file_ext: str) -> Lis
if json_match:
tags_json = json_match.group(0)

tags = json.loads(tags_json)
return tags
try:
tags = json.loads(tags_json)
# Validate tags are reasonable
if not isinstance(tags, list) or len(tags) == 0:
return []
# Filter out any non-string tags
tags = [str(tag) for tag in tags if isinstance(tag, str) and tag.strip()]
return tags
except (json.JSONDecodeError, ValueError):
print(f"Failed to parse AI response: {tags_json}")
return []

def generate_tags_heuristic(self, file_name: str, file_ext: str) -> List[str]:
"""Generate tags using heuristic rules (fallback)"""
Expand Down
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
ollama>=0.1.0
python-dotenv>=1.0.0
flask>=2.3.0
Loading