Semantic Search CLI for PDF Files

This Rust-based command-line tool allows you to locally index your private PDF files and perform semantic searches on their content.

⚠️ Work in Progress

Please note that this CLI is still under active development and is not yet available as a pre-built binary. Users will need to build and run the project using Cargo commands.

Features

Index multiple PDF files for fast searching
Perform semantic searches on indexed content
Extract text from both PDF pages and embedded images
Cache index data for improved performance

Building and Running

To use this CLI in its current state:

Ensure you have Rust and Cargo installed on your system.
Install tesseract package for parsing PDF
```
brew install tesseract
```

Clone this repository:

git clone https://github.com/breakpointninja/semantic_search_cli.git
cd semantic_search_cli

To build the project:
```
cargo build
```
To run the CLI:
```
cargo run -- [arguments]
```

Usage

The CLI offers two main commands: index and search.

Indexing PDF Files

To index PDF files, use the following command:

semantic_search_cli index <FILES>...

Replace <FILES>... with the paths to the PDF files you want to index.

Searching Indexed PDFs

To search the indexed PDF files, use the following command:

semantic_search_cli search <QUERY>

Replace <QUERY> with your search query.

Technical Details

The tool caches the index in the user's local data directory for faster subsequent searches.
It runs on the amazing ort runtime for fast vector embedding generation.
It uses the usearch crate to perform efficient vector search operations using HNSW index.
Text extraction from PDF pages is handled by the pdfium-render crate.
The image crate is used to extract text from images embedded in PDFs.
Search results and PDF file details are stored using SQLite via the rusqlite crate.
Embeddings for PDF content are generated using the fastembed crate.
The BAAI/bge-base-en-v1.5 embedding model is used to generate embeddings for search queries.
Chunking is implemented using a naive, brute-force approach with windowed embedding of overlapping chunks.
Search results are sorted by their distance from the query embedding.

Limitations and Future Improvements

This tool is a first draft and is not optimized or production-ready.
Indexing is currently single-threaded due to thread safety limitations in the fastembed and pdfium-render crates.
The chunking algorithm is basic and could be improved for better performance and accuracy.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
build.rs		build.rs
parameters.txt		parameters.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Semantic Search CLI for PDF Files

⚠️ Work in Progress

Features

Building and Running

Usage

Indexing PDF Files

Searching Indexed PDFs

Technical Details

Limitations and Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

breakpointninja/semantic_search_cli

Folders and files

Latest commit

History

Repository files navigation

Semantic Search CLI for PDF Files

⚠️ Work in Progress

Features

Building and Running

Usage

Indexing PDF Files

Searching Indexed PDFs

Technical Details

Limitations and Future Improvements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages