High-performance Rust implementation of nupunkt, a modern reimplementation of the Punkt sentence tokenizer optimized for high-precision legal and financial text processing. This project provides the same accurate sentence segmentation as the original Python nupunkt library, but with 3x faster performance thanks to Rust's efficiency.
Based on the research paper: Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary (Bommarito et al., 2025)
- 🚀 High Performance: 30M+ characters/second (3x faster than Python nupunkt)
- 🎯 High Precision: 91.1% precision on legal text benchmarks
- ⚡ Runtime Adjustable: Tune precision/recall balance at inference time without retraining
- 📚 Legal-Optimized: Pre-trained model handles complex legal abbreviations and citations
- 🐍 Python API: Drop-in replacement for Python nupunkt with PyO3 bindings
- 🧵 Thread-Safe: Safe for parallel processing
# pip
pip install nupunkt-rs
# uv
uv pip install nupunkt-rs
-
Prerequisites:
- Python 3.11+
- Rust toolchain (install from rustup.rs)
- maturin (
pip install maturin
)
-
Clone and Install:
git clone https://github.com/alea-institute/nupunkt-rs.git
cd nupunkt-rs
# pip
pip install maturin
maturin develop --release
# uv
uvx maturin develop --release --uv
Most tokenizers fail on legal and financial text, breaking incorrectly at abbreviations like "v.", "U.S.", "Inc.", "Id.", and "Fed." This library is specifically optimized for high-precision tokenization of complex professional documents.
import nupunkt_rs
# Real Supreme Court text with complex citations and abbreviations
legal_text = """As we explained in Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579, 597 (1993), Rule 702's requirement that an expert's testimony pertain to "scientific knowledge" establishes a standard of evidentiary reliability. This Court addressed the application of this standard to technical, as opposed to scientific, expert testimony in Kumho Tire Co. v. Carmichael, 526 U.S. 137 (1999). There, we explained that the gatekeeping inquiry must be tied to the facts of a particular case. Id. at 150."""
# Most tokenizers would incorrectly break at "v.", "Inc.", "U.S.", "Co.", and "Id."
# nupunkt-rs handles all of these correctly:
sentences = nupunkt_rs.sent_tokenize(legal_text)
print(f"Correctly identified {len(sentences)} sentences:")
for i, sent in enumerate(sentences, 1):
print(f"\n{i}. {sent}")
# Output:
# Correctly identified 3 sentences:
#
# 1. As we explained in Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579, 597 (1993), Rule 702's requirement that an expert's testimony pertain to "scientific knowledge" establishes a standard of evidentiary reliability.
#
# 2. This Court addressed the application of this standard to technical, as opposed to scientific, expert testimony in Kumho Tire Co. v. Carmichael, 526 U.S. 137 (1999).
#
# 3. There, we explained that the gatekeeping inquiry must be tied to the facts of a particular case. Id. at 150.
The precision_recall
parameter (0.0-1.0) gives you exact control over the precision/recall trade-off. For legal and financial documents, you typically want higher precision (0.3-0.5) to avoid breaking at abbreviations.
# Longer legal text to show the impact
long_legal_text = """As we explained in Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579, 597 (1993), Rule 702's requirement that an expert's testimony pertain to "scientific knowledge" establishes a standard of evidentiary reliability. This Court addressed the application of this standard to technical, as opposed to scientific, expert testimony in Kumho Tire Co. v. Carmichael, 526 U.S. 137 (1999). There, we explained that the gatekeeping inquiry must be tied to the facts of a particular case. Id. at 150. This Court further noted that Rule 702 was amended in response to Daubert and this Court's subsequent cases. See Fed. Rule Evid. 702, Advisory Committee Notes to 2000 Amendments. The amendment affirms the trial court's role as gatekeeper but provides that "all types of expert testimony present questions of admissibility for the trial court." Ibid. Consequently, whether the specific expert testimony on the question at issue focuses on specialized observations, the specialized translation of those observations into theory, a specialized theory itself, or the application of such a theory in a particular case, the expert's testimony often will rest "upon an experience confessedly foreign in kind to [the jury's] own." Hand, Historical and Practical Considerations Regarding Expert Testimony, 15 Harv. L. Rev. 40, 54 (1901). For this reason, the trial judge, in all cases of proffered expert testimony, must find that it is properly grounded, well-reasoned, and not speculative before it can be admitted. The trial judge must determine whether the testimony has "a reliable basis in the knowledge and experience of [the relevant] discipline." Daubert, 509 U. S., at 592."""
# Compare different precision levels
print(f"High recall (PR=0.1): {len(nupunkt_rs.sent_tokenize(long_legal_text, precision_recall=0.1))} sentences")
print(f"Balanced (PR=0.5): {len(nupunkt_rs.sent_tokenize(long_legal_text, precision_recall=0.5))} sentences")
print(f"High precision (PR=0.9): {len(nupunkt_rs.sent_tokenize(long_legal_text, precision_recall=0.9))} sentences")
# Output:
# High recall (PR=0.1): 8 sentences
# Balanced (PR=0.5): 7 sentences
# High precision (PR=0.9): 5 sentences
# Show the actual sentences at balanced setting (recommended for legal text)
sentences = nupunkt_rs.sent_tokenize(long_legal_text, precision_recall=0.5)
print("\nBalanced output (PR=0.5) - Recommended for legal documents:")
for i, sent in enumerate(sentences, 1):
# Show that abbreviations are correctly preserved
if "v." in sent or "U.S." in sent or "Id." in sent or "Fed." in sent:
print(f"\n{i}. ✓ Correctly preserves legal abbreviations:")
print(f" {sent[:100]}...")
Recommended precision_recall
settings:
- Legal documents: 0.3-0.5 (preserves "v.", "Id.", "Fed.", "U.S.", "Inc.")
- Financial reports: 0.4-0.6 (preserves "Inc.", "Ltd.", "Q1", monetary abbreviations)
- Scientific papers: 0.4-0.6 (preserves "et al.", "e.g.", "i.e.", technical terms)
- General text: 0.5 (default, balanced)
- Social media: 0.1-0.3 (more aggressive breaking for informal text)
For documents with multiple paragraphs, you can tokenize at both paragraph and sentence levels:
import nupunkt_rs
text = """First paragraph with legal citations.
See Smith v. Jones, 123 U.S. 456 (2020).
Second paragraph with more detail.
The court in Id. at 457 stated clearly."""
# Get paragraphs as lists of sentences
paragraphs = nupunkt_rs.para_tokenize(text)
print(f"Found {len(paragraphs)} paragraphs")
# Each paragraph is a list of properly segmented sentences
# Or get paragraphs as joined strings
paragraphs_joined = nupunkt_rs.para_tokenize_joined(text)
# Each paragraph is a single string with sentences joined
import nupunkt_rs
# Create a tokenizer with the default model
tokenizer = nupunkt_rs.create_default_tokenizer()
# Default (0.5) - balanced mode
text = "The meeting is at 5 p.m. tomorrow. We'll discuss Q4."
print(tokenizer.tokenize(text))
# Output: ['The meeting is at 5 p.m. tomorrow.', "We'll discuss Q4."]
# High recall (0.1) - more breaks, may split at abbreviations
tokenizer.set_precision_recall_balance(0.1)
print(tokenizer.tokenize(text))
# May split after "p.m."
# High precision (0.9) - fewer breaks, preserves abbreviations
tokenizer.set_precision_recall_balance(0.9)
print(tokenizer.tokenize(text))
# Won't split after "p.m."
import nupunkt_rs
# Process multiple documents efficiently
documents = [
"First doc. Two sentences.",
"Second document here.",
"Third doc. Also two sentences."
]
# Use list comprehension for batch processing
all_sentences = [nupunkt_rs.sent_tokenize(doc) for doc in documents]
print(all_sentences)
# Output: [['First doc.', 'Two sentences.'], ['Second document here.'], ['Third doc.', 'Also two sentences.']]
import nupunkt_rs
# Get sentence boundaries as character positions
tokenizer = nupunkt_rs.create_default_tokenizer()
text = "First sentence. Second sentence."
spans = tokenizer.tokenize_spans(text)
print(spans)
# Output: [(0, 15), (16, 32)]
# Extract sentences using spans
for start, end in spans:
print(f"'{text[start:end]}'")
# Output: 'First sentence.' 'Second sentence.'
# Quick tokenization with default model
echo "Dr. Smith arrived. He was late." | nupunkt tokenize
# Adjust precision/recall from command line
nupunkt tokenize --pr-balance 0.8 "Your text here."
# Process a file
nupunkt tokenize --input document.txt --output sentences.txt
Get detailed insights into why breaks occur or don't occur:
# Get detailed analysis of each token
analysis = tokenizer.analyze_tokens(text)
for token in analysis.tokens:
if token.has_period:
print(f"Token: {token.text}")
print(f" Break decision: {token.decision}")
print(f" Confidence: {token.confidence:.2f}")
# Explain a specific position
explanation = tokenizer.explain_decision(text, 28) # Position of period after "Dr."
print(explanation)
# Get character positions instead of text
spans = tokenizer.tokenize_spans(text)
# Returns: [(start1, end1), (start2, end2), ...]
for start, end in spans:
print(f"Sentence: {text[start:end]}")
For domain-specific text, you can train your own model:
trainer = nupunkt_rs.Trainer()
# Optional: Load domain-specific abbreviations
trainer.load_abbreviations_from_json("legal_abbreviations.json")
# Train on your corpus
params = trainer.train(your_text_corpus, verbose=True)
# Save model for reuse
params.save("my_model.npkt.gz")
# Load and use later
params = nupunkt_rs.Parameters.load("my_model.npkt.gz")
tokenizer = nupunkt_rs.SentenceTokenizer(params)
Benchmarks on commodity hardware (Linux, Intel x86_64):
Text Size | Processing Time | Speed |
---|---|---|
1 KB | < 0.1ms | ~10 MB/s |
100 KB | ~3ms | ~30 MB/s |
1 MB | ~33ms | ~30 MB/s |
10 MB | ~330ms | ~30 MB/s |
The tokenizer maintains consistent speed regardless of text size, processing approximately 30 million characters per second.
Memory usage is minimal - the default model uses about 12 MB of RAM, compared to 85+ MB for NLTK's Punkt implementation.
-
sent_tokenize(text, model_params=None, precision_recall=None)
→ List of sentencestext
: The text to tokenizemodel_params
: Optional custom model parametersprecision_recall
: Optional PR balance (0.0=recall, 1.0=precision, default=0.5)
-
para_tokenize(text, model_params=None, precision_recall=None)
→ List of paragraphs (each as list of sentences)- Same parameters as
sent_tokenize
- Same parameters as
-
para_tokenize_joined(text, model_params=None, precision_recall=None)
→ List of paragraphs (each as single string)- Same parameters as
sent_tokenize
- Same parameters as
-
create_default_tokenizer()
→ Returns aSentenceTokenizer
with default model -
load_default_model()
→ Returns defaultParameters
-
train_model(text, verbose=False)
→ Train new model on text
-
SentenceTokenizer
: The main class for tokenizing texttokenize(text)
→ List of sentencestokenize_spans(text)
→ List of (start, end) positionstokenize_paragraphs(text)
→ List of paragraphs (each as list of sentences)tokenize_paragraphs_flat(text)
→ List of paragraphs (each as single string)set_precision_recall_balance(0.0-1.0)
→ Adjust behavioranalyze_tokens(text)
→ Detailed token analysisexplain_decision(text, position)
→ Explain break decision at position
-
Parameters
: Model parameterssave(path)
→ Save model to disk (compressed)load(path)
→ Load model from disk
-
Trainer
: For training custom models (advanced users only)train(text, verbose=False)
→ Train on text corpusload_abbreviations_from_json(path)
→ Load custom abbreviations
# Rust tests
cargo test
# Python tests
pytest python/tests/
# With coverage
cargo tarpaulin
pytest --cov=nupunkt_rs
# Format code
cargo fmt
black python/
# Lint
cargo clippy -- -D warnings
ruff check python/
# Type checking
mypy python/
# Rust docs
cargo doc --open
# Python docs
cd docs && make html
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
- Additional language support
- Performance optimizations
- More abbreviation lists
- Documentation improvements
- Test coverage expansion
MIT License - see LICENSE for details.
If you use nupunkt-rs in your research, please cite the original nupunkt paper:
@article{bommarito2025precise,
title={Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary},
author={Bommarito, Michael J and Katz, Daniel Martin and Bommarito, Jillian},
journal={arXiv preprint arXiv:2504.04131},
year={2025}
}
For the Rust implementation specifically:
@software{nupunkt-rs,
title = {nupunkt-rs: High-performance Rust implementation of nupunkt},
author = {ALEA Institute},
year = {2025},
url = {https://github.com/alea-institute/nupunkt-rs}
}
- Original Punkt algorithm by Kiss & Strunk (2006)
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: [email protected]