This Rust-based command-line tool allows you to locally index your private PDF files and perform semantic searches on their content.
Please note that this CLI is still under active development and is not yet available as a pre-built binary. Users will need to build and run the project using Cargo commands.
- Index multiple PDF files for fast searching
- Perform semantic searches on indexed content
- Extract text from both PDF pages and embedded images
- Cache index data for improved performance
To use this CLI in its current state:
- Ensure you have Rust and Cargo installed on your system.
- Install tesseract package for parsing PDF
brew install tesseract - Clone this repository:
git clone https://github.com/breakpointninja/semantic_search_cli.git cd semantic_search_cli - To build the project:
cargo build - To run the CLI:
cargo run -- [arguments]
The CLI offers two main commands: index and search.
To index PDF files, use the following command:
semantic_search_cli index <FILES>...Replace <FILES>... with the paths to the PDF files you want to index.
To search the indexed PDF files, use the following command:
semantic_search_cli search <QUERY>Replace <QUERY> with your search query.
- The tool caches the index in the user's local data directory for faster subsequent searches.
- It runs on the amazing ort runtime for fast vector embedding generation.
- It uses the usearch crate to perform efficient vector search operations using HNSW index.
- Text extraction from PDF pages is handled by the pdfium-render crate.
- The image crate is used to extract text from images embedded in PDFs.
- Search results and PDF file details are stored using SQLite via the rusqlite crate.
- Embeddings for PDF content are generated using the fastembed crate.
- The BAAI/bge-base-en-v1.5 embedding model is used to generate embeddings for search queries.
- Chunking is implemented using a naive, brute-force approach with windowed embedding of overlapping chunks.
- Search results are sorted by their distance from the query embedding.
- This tool is a first draft and is not optimized or production-ready.
- Indexing is currently single-threaded due to thread safety limitations in the
fastembedandpdfium-rendercrates. - The chunking algorithm is basic and could be improved for better performance and accuracy.