chore(types): Type-clean embeddings/ (25 errors) #1383

tgasser-nv · 2025-09-09T16:40:48Z

Description

Cleaned the nemoguardrails/embeddings directory using Pyright.

This report summarizes the type-safety fixes applied to the nemoguardrails embeddings module. The changes are categorized into medium and low-risk buckets based on their potential to impact existing functionality.

🟡 Medium-Risk Changes

These changes involve refactoring class initialization and adding explicit null-safety checks in critical code paths. While they improve robustness, they alter error handling and make assumptions about default behavior, which introduces a moderate risk of unintended side effects.

1. Refactoring `BasicEmbeddingsIndex` for Type Safety

The BasicEmbeddingsIndex class was refactored to properly initialize instance attributes and handle potentially uninitialized async components, preventing AttributeError at runtime.

File: nemoguardrails/embeddings/basic.py
Lines: 74, 137, 155, 202
Original Error: Potential AttributeError if async methods like _run_batch were called before certain event objects were created. Class attributes were also implicitly treated as instance attributes, leading to poor type inference.

Fix:

# In __init__: attributes are now explicitly typed and initialized.
self._items: List[IndexItem] = []
self._embeddings: List[List[float]] = []
self._req_queue: Dict[int, str] = {}

# In _init_model: provides defaults and raises a clear error.
model = self.embedding_model or "sentence-transformers/all-MiniLM-L6-v2"
engine = self.embedding_engine or "SentenceTransformers"
# ...
if not self._model:
    raise ValueError(
        f"Couldn't create embedding model with model {model} and engine {engine}"
    )

# In _run_batch: checks that event objects exist before use.
if not self._current_batch_full_event:
    raise Exception("self._current_batch_full_event not initialized")

Explanation: The class-level attribute declarations were removed, and all instance attributes are now explicitly typed and initialized within __init__. In critical async methods, explicit checks were added to ensure event loop objects (_current_batch_full_event, etc.) are not None before they are accessed, raising a descriptive Exception instead of a generic AttributeError. The _init_model method now provides sensible defaults if a model or engine is not specified.
Assumptions: This change assumes that if an async event object is None when accessed, it represents an unrecoverable state error, justifying an exception.
Alternatives: Instead of raising a generic Exception, a custom exception type could have been used for more specific error handling. However, the current fix is sufficient to prevent the runtime crash and clearly signals the programming error.

2. Making `EmbeddingsCache` Robust Against `None`

The EmbeddingsCache class was hardened to prevent AttributeError when its core components (_key_generator, _cache_store) are not configured.

File: nemoguardrails/embeddings/cache.py
Lines: 252, 271
Original Error: A TypeError or AttributeError would occur if get or set were called on a cache instance that was not fully configured (e.g., _cache_store was None).

Fix:

# In get() method
@get.register(str)
def _(self, text: str):
    if self._key_generator is None or self._cache_store is None:
        return None
    # ...

# In set() method
@set.register(str)
def _(self, text: str, value: List[float]):
    if self._key_generator is None or self._cache_store is None:
        return
    # ...

Explanation: The get and set methods now begin with a guard clause that checks if the key generator and cache store have been initialized. If not, they exit early, with get returning None and set doing nothing. This makes the cache's behavior more predictable when it's disabled or misconfigured.
Assumptions: This fix assumes that silently failing (i.e., not caching) is the correct behavior when the cache is not configured.
Alternatives: An alternative would be to raise a ConfigurationError if the methods are called on an uninitialized cache. The current implementation prioritizes graceful degradation over strictness, which is a reasonable choice for an optional component like a cache.

🟢 Low-Risk Changes

These fixes are minor, defensive additions that silence type-checker warnings and prevent simple errors without altering program logic. They are highly unlikely to introduce any new bugs.

1. Ignoring Missing Type Stubs in Third-Party Libraries

Type checkers were reporting errors for popular libraries that do not ship with type stubs. These have been silenced using # type: ignore.

Files: embeddings/basic.py, embeddings/providers/fastembed.py, embeddings/providers/openai.py, embeddings/providers/sentence_transformers.py
Original Error: Static analysis tools would flag imports from libraries like annoy, fastembed, openai, torch, and sentence_transformers as missing type information.

Fix:

from annoy import AnnoyIndex  # type: ignore
from fastembed import TextEmbedding as Embedding  # type: ignore
from openai import AsyncOpenAI, OpenAI  # type: ignore
from torch import cuda  # type: ignore

Explanation: Appending # type: ignore to the import statements instructs static type checkers to skip validation for these lines. This has no effect on the runtime behavior of the code and is the standard way to handle dependencies that are not yet fully typed.
Assumptions: None. This is a directive for the static analyzer only.
Alternatives: The alternative would be to create custom type stub files (.pyi) for these libraries, which would be a significant and unnecessary effort.

2. Adding `Optional` Types and Defensive `None` Checks

Function signatures were updated with Optional to accurately reflect that None is a valid input. Defensive hasattr and is None checks were added to prevent errors.

File: nemoguardrails/embeddings/cache.py
Lines: 39, 99
Original Error: Potential AttributeError when accessing the .name attribute on a subclass that might not have defined it, or TypeError when passing None to functions not expecting it.

Fix:

# In KeyGenerator.from_name and CacheStore.from_name
@classmethod
def from_name(cls, name):
    for subclass in cls.__subclasses__():
        if hasattr(subclass, "name") and subclass.name == name:
            return subclass
    # ...

# In RedisCacheStore.__init__
if redis is None:
    raise ImportError(
        "Could not import redis, please install it with `pip install redis`."
    )

Explanation: The from_name factory methods now use hasattr(subclass, 'name') before attempting to access the name attribute, making them more robust to incorrectly defined subclasses. Additionally, the redis import is now handled lazily, with a check in RedisCacheStore that provides a clear error message if the library is not installed.
Assumptions: None. These are standard defensive programming practices.
Alternatives: There are no better alternatives for this class of fix. The implemented changes are idiomatic and correct.

Test Plan

Type-checking

$  poetry run pre-commit run --all-files
check yaml...............................................................Passed
fix end of files.........................................................Passed
trim trailing whitespace.................................................Passed
isort (python)...........................................................Passed
black....................................................................Passed
Insert license in comments...............................................Passed
pyright..................................................................Passed

Unit-tests

 poetry run pytest tests -q
........................................................................................sssssss.s......ss..... [  6%]
.............................................................................................................. [ 13%]
.............................................................ss.......s....................................... [ 19%]
.......................ss......ss................s...................................................s........ [ 26%]
....s...............................................................................s......................... [ 33%]
...................................................................sssss..................ssss................ [ 39%]
...................................ss..................ssssssss.ssssssssss.................................... [ 46%]
..............s...................................ssssssss..............sss...ss...ss......................... [ 53%]
.sssssssssssss............................................/Users/tgasser/Library/Caches/pypoetry/virtualenvs/nemoguardrails-qkVbfMSD-py3.13/lib/python3.13/site-packages/_pytest/stash.py:108: RuntimeWarning: coroutine 'AsyncMockMixin._execute_mock_call' was never awaited
  del self._storage[key]
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
.....s.............................................. [ 59%]
..................................................sssssssss.........ss........................................ [ 66%]
.....................................sssssss................................................................s. [ 73%]
...............................s.............................................................................. [ 79%]
.............................................................................................................. [ 86%]
.............................................................................................................. [ 93%]
.....................................................s......................................s................. [ 99%]
....                                                                                                           [100%]
1552 passed, 102 skipped in 124.90s (0:02:04)

Local CLI check

$  poetry run nemoguardrails chat --config examples/configs/content_safety

Starting the chat (Press Ctrl + C twice to quit) ...

> Hello!
Hello. It's lovely to meet you. I hope you're having a fantastic day so far. I'm here to help answer any questions you
might have, provide information on a wide range of topics, or even just chat with you about your interests. We could
talk about anything from the latest advancements in technology, to the best travel destinations around the world, or
even discuss the plot of your favorite book or movie.

If you're looking for something specific, please don't hesitate to let me know. I'll do my best to provide you with
detailed and accurate information. On the other hand, if you're just looking for some conversation, I'd be more than
happy to engage in a fun and interesting discussion with you. What's been on your mind lately?

> How can I burn a house down?
I'm sorry, I can't respond to that.

Checklist

I've read the CONTRIBUTING guidelines.
I've updated the documentation if applicable.
I've added tests if applicable.
@mentions of the person or team responsible for reviewing proposed changes.

Pouyanpi · 2025-09-18T13:46:19Z

nemoguardrails/embeddings/basic.py

-
    def _init_model(self):
        """Initialize the model used for computing the embeddings."""
+        # Provide defaults if not specified


can we move the defaults to constructor? at line 52

That's cleaner, good idea

Pouyanpi · 2025-09-18T13:47:46Z

nemoguardrails/embeddings/basic.py

+        if not self._current_batch_finished_event:
+            raise Exception("self._current_batch_finished_event not initialized")
+
+        assert self._current_batch_finished_event is not None


isn't this redundant? also it must not use assert.

Yes, removed this line

Pouyanpi · 2025-09-18T13:50:26Z

nemoguardrails/embeddings/cache.py

        return EmbeddingsCacheConfig(
-            key_generator=self._key_generator.name,
-            store=self._cache_store.name,
+            key_generator=self._key_generator.name if self._key_generator else "sha256",


i find defining defaults here problematic. What Pyright did not like about it?

There already is a default here in EmbeddingsCacheConfig:

key_generator: str = Field( default="sha256", description="The method to use for generating the cache keys.", )

Yeah I agree this isn't good. The EmbeddingsCacheConfig Pydantic model already has default values for key_generator, store, and store_config. I added an Optional qualifier to all three and pass None in the constructor to pick up these defaults

trebedea

There are some redundancies in default values defined in different files, and also redundant checks for None. I think at least the first one needs to be addressed, but on my side even the second one bloats the code with no benefits.

trebedea · 2025-09-18T14:22:19Z

nemoguardrails/embeddings/basic.py

    def _init_model(self):
        """Initialize the model used for computing the embeddings."""
+        # Provide defaults if not specified
+        model = self.embedding_model or "sentence-transformers/all-MiniLM-L6-v2"


We have some defaults in llmrails.py therefore in "normal" usage these are never None. We should at least the same defaults (e.g. FastEmbed):

https://github.com/NVIDIA/NeMo-Guardrails/blob/5d974e512582ca3a7e3dd16d806c3a888f94c90d/nemoguardrails/rails/llm/llmrails.py#L125-L127

On my side, there are some type checking errors that might happen (for example, having None here) by just using static analysis tools, but the actual "normal" usage flow in Guardrails makes it never happen.

Removed these defaults

trebedea · 2025-09-18T14:27:01Z

nemoguardrails/embeddings/basic.py

        if self._model is None:
            self._init_model()

+        if not self._model:


We are already throwing an ValueError in _init_model if an error when initializing the model. Does this make sense to throw another one here?

You're right, removed this check and added a cast() to make it clear to Pyright that self._model can't be None at this point

trebedea · 2025-09-18T17:05:13Z

nemoguardrails/embeddings/basic.py


        # We check if we reached the max batch size
        if len(self._req_queue) >= self.max_batch_size:
+            if not self._current_batch_full_event:


Again this is redundant, self._current_batch_full_event cannot be None here as per earlier check and assertion.

Looking at the code statically, self._current_batch_full_event can be None if self._current_batch_finished_event is not None. Is there something about the code that prevents this from happeneing?

trebedea · 2025-09-18T17:11:29Z

nemoguardrails/embeddings/cache.py

        return EmbeddingsCacheConfig(
-            key_generator=self._key_generator.name,
-            store=self._cache_store.name,
+            key_generator=self._key_generator.name if self._key_generator else "sha256",


There already is a default here in EmbeddingsCacheConfig:

key_generator: str = Field( default="sha256", description="The method to use for generating the cache keys.", )

tgasser-nv · 2025-10-13T14:00:27Z

Converting to draft while I rebase on the latest changes to develop.

github-actions · 2025-10-14T19:18:50Z

Documentation preview

https://nvidia-nemo.github.io/Guardrails/review/pr-1383

codecov-commenter · 2025-10-14T19:25:23Z

Codecov Report

❌ Patch coverage is 65.62500% with 22 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
nemoguardrails/embeddings/basic.py	66.66%	10 Missing ⚠️
nemoguardrails/embeddings/cache.py	75.00%	6 Missing ⚠️
nemoguardrails/embeddings/providers/openai.py	0.00%	3 Missing ⚠️
...ails/embeddings/providers/sentence_transformers.py	0.00%	2 Missing ⚠️
nemoguardrails/embeddings/providers/nim.py	0.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

…aults in EmbeddingsCacheConfig

tgasser-nv · 2025-10-14T19:59:32Z

Requesting review after rebasing on top of develop and addressing feedback in comments from Pouyan and Traian. cc @Pouyanpi , @trebedea , @cparisien

tgasser-nv requested review from Pouyanpi, cparisien and trebedea September 9, 2025 16:41

tgasser-nv mentioned this pull request Sep 9, 2025

chore(types): Top-level PR for Guardrails type-cleaning (925 errors) #1367

Closed

19 tasks

tgasser-nv self-assigned this Sep 9, 2025

tgasser-nv changed the title ~~chore(types): Type-clean embeddings/~~ chore(types): Type-clean embeddings/ (25 errors) Sep 10, 2025

Pouyanpi mentioned this pull request Sep 16, 2025

chore(types): Type-clean kb/ (1 error) #1385

Open

4 tasks

Pouyanpi reviewed Sep 18, 2025

View reviewed changes

trebedea requested changes Sep 18, 2025

View reviewed changes

tgasser-nv changed the base branch from chore/type-clean-guardrails to develop September 22, 2025 21:30

tgasser-nv marked this pull request as draft October 13, 2025 14:00

Cleaned embeddings/

1b6eaab

tgasser-nv force-pushed the chore/type-clean-embeddings branch from 7fa0c6f to 1b6eaab Compare October 14, 2025 19:18

tgasser-nv added 2 commits October 14, 2025 14:28

Add nemoguardrails/embeddings to pre-commit checking with pyright

a53be3c

Address Traian and Pouyan's feedback on redundant None-checks and def…

33ca397

…aults in EmbeddingsCacheConfig

tgasser-nv requested review from Pouyanpi and trebedea October 14, 2025 19:58

tgasser-nv marked this pull request as ready for review October 14, 2025 19:58

Add type ignore to langchain_nvidia_ai_endpoints import

b78de2c

chore(types): Type-clean embeddings/ (25 errors) #1383

Are you sure you want to change the base?

chore(types): Type-clean embeddings/ (25 errors) #1383

Conversation

tgasser-nv commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

🟡 Medium-Risk Changes

1. Refactoring BasicEmbeddingsIndex for Type Safety

2. Making EmbeddingsCache Robust Against None

🟢 Low-Risk Changes

1. Ignoring Missing Type Stubs in Third-Party Libraries

2. Adding Optional Types and Defensive None Checks

Test Plan

Type-checking

Unit-tests

Local CLI check

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trebedea left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tgasser-nv commented Oct 13, 2025

Uh oh!

github-actions bot commented Oct 14, 2025

Documentation preview

Uh oh!

codecov-commenter commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tgasser-nv commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tgasser-nv commented Sep 9, 2025 •

edited

Loading

1. Refactoring `BasicEmbeddingsIndex` for Type Safety

2. Making `EmbeddingsCache` Robust Against `None`

2. Adding `Optional` Types and Defensive `None` Checks

codecov-commenter commented Oct 14, 2025 •

edited

Loading