Skip to content

Conversation

tgasser-nv
Copy link
Collaborator

@tgasser-nv tgasser-nv commented Sep 9, 2025

Description

Cleaned the nemoguardrails/embeddings directory using Pyright.


This report summarizes the type-safety fixes applied to the nemoguardrails embeddings module. The changes are categorized into medium and low-risk buckets based on their potential to impact existing functionality.

🟡 Medium-Risk Changes

These changes involve refactoring class initialization and adding explicit null-safety checks in critical code paths. While they improve robustness, they alter error handling and make assumptions about default behavior, which introduces a moderate risk of unintended side effects.

1. Refactoring BasicEmbeddingsIndex for Type Safety

The BasicEmbeddingsIndex class was refactored to properly initialize instance attributes and handle potentially uninitialized async components, preventing AttributeError at runtime.

  • File: nemoguardrails/embeddings/basic.py
  • Lines: 74, 137, 155, 202
  • Original Error: Potential AttributeError if async methods like _run_batch were called before certain event objects were created. Class attributes were also implicitly treated as instance attributes, leading to poor type inference.
  • Fix:
    # In __init__: attributes are now explicitly typed and initialized.
    self._items: List[IndexItem] = []
    self._embeddings: List[List[float]] = []
    self._req_queue: Dict[int, str] = {}
    
    # In _init_model: provides defaults and raises a clear error.
    model = self.embedding_model or "sentence-transformers/all-MiniLM-L6-v2"
    engine = self.embedding_engine or "SentenceTransformers"
    # ...
    if not self._model:
        raise ValueError(
            f"Couldn't create embedding model with model {model} and engine {engine}"
        )
    
    # In _run_batch: checks that event objects exist before use.
    if not self._current_batch_full_event:
        raise Exception("self._current_batch_full_event not initialized")
  • Explanation: The class-level attribute declarations were removed, and all instance attributes are now explicitly typed and initialized within __init__. In critical async methods, explicit checks were added to ensure event loop objects (_current_batch_full_event, etc.) are not None before they are accessed, raising a descriptive Exception instead of a generic AttributeError. The _init_model method now provides sensible defaults if a model or engine is not specified.
  • Assumptions: This change assumes that if an async event object is None when accessed, it represents an unrecoverable state error, justifying an exception.
  • Alternatives: Instead of raising a generic Exception, a custom exception type could have been used for more specific error handling. However, the current fix is sufficient to prevent the runtime crash and clearly signals the programming error.

2. Making EmbeddingsCache Robust Against None

The EmbeddingsCache class was hardened to prevent AttributeError when its core components (_key_generator, _cache_store) are not configured.

  • File: nemoguardrails/embeddings/cache.py
  • Lines: 252, 271
  • Original Error: A TypeError or AttributeError would occur if get or set were called on a cache instance that was not fully configured (e.g., _cache_store was None).
  • Fix:
    # In get() method
    @get.register(str)
    def _(self, text: str):
        if self._key_generator is None or self._cache_store is None:
            return None
        # ...
    
    # In set() method
    @set.register(str)
    def _(self, text: str, value: List[float]):
        if self._key_generator is None or self._cache_store is None:
            return
        # ...
  • Explanation: The get and set methods now begin with a guard clause that checks if the key generator and cache store have been initialized. If not, they exit early, with get returning None and set doing nothing. This makes the cache's behavior more predictable when it's disabled or misconfigured.
  • Assumptions: This fix assumes that silently failing (i.e., not caching) is the correct behavior when the cache is not configured.
  • Alternatives: An alternative would be to raise a ConfigurationError if the methods are called on an uninitialized cache. The current implementation prioritizes graceful degradation over strictness, which is a reasonable choice for an optional component like a cache.

🟢 Low-Risk Changes

These fixes are minor, defensive additions that silence type-checker warnings and prevent simple errors without altering program logic. They are highly unlikely to introduce any new bugs.

1. Ignoring Missing Type Stubs in Third-Party Libraries

Type checkers were reporting errors for popular libraries that do not ship with type stubs. These have been silenced using # type: ignore.

  • Files: embeddings/basic.py, embeddings/providers/fastembed.py, embeddings/providers/openai.py, embeddings/providers/sentence_transformers.py
  • Original Error: Static analysis tools would flag imports from libraries like annoy, fastembed, openai, torch, and sentence_transformers as missing type information.
  • Fix:
    from annoy import AnnoyIndex  # type: ignore
    from fastembed import TextEmbedding as Embedding  # type: ignore
    from openai import AsyncOpenAI, OpenAI  # type: ignore
    from torch import cuda  # type: ignore
  • Explanation: Appending # type: ignore to the import statements instructs static type checkers to skip validation for these lines. This has no effect on the runtime behavior of the code and is the standard way to handle dependencies that are not yet fully typed.
  • Assumptions: None. This is a directive for the static analyzer only.
  • Alternatives: The alternative would be to create custom type stub files (.pyi) for these libraries, which would be a significant and unnecessary effort.

2. Adding Optional Types and Defensive None Checks

Function signatures were updated with Optional to accurately reflect that None is a valid input. Defensive hasattr and is None checks were added to prevent errors.

  • File: nemoguardrails/embeddings/cache.py
  • Lines: 39, 99
  • Original Error: Potential AttributeError when accessing the .name attribute on a subclass that might not have defined it, or TypeError when passing None to functions not expecting it.
  • Fix:
    # In KeyGenerator.from_name and CacheStore.from_name
    @classmethod
    def from_name(cls, name):
        for subclass in cls.__subclasses__():
            if hasattr(subclass, "name") and subclass.name == name:
                return subclass
        # ...
    
    # In RedisCacheStore.__init__
    if redis is None:
        raise ImportError(
            "Could not import redis, please install it with `pip install redis`."
        )
  • Explanation: The from_name factory methods now use hasattr(subclass, 'name') before attempting to access the name attribute, making them more robust to incorrectly defined subclasses. Additionally, the redis import is now handled lazily, with a check in RedisCacheStore that provides a clear error message if the library is not installed.
  • Assumptions: None. These are standard defensive programming practices.
  • Alternatives: There are no better alternatives for this class of fix. The implemented changes are idiomatic and correct.

Test Plan

Type-checking

$  poetry run pre-commit run --all-files
check yaml...............................................................Passed
fix end of files.........................................................Passed
trim trailing whitespace.................................................Passed
isort (python)...........................................................Passed
black....................................................................Passed
Insert license in comments...............................................Passed
pyright..................................................................Passed

Unit-tests

 poetry run pytest tests -q
........................................................................................sssssss.s......ss..... [  6%]
.............................................................................................................. [ 13%]
.............................................................ss.......s....................................... [ 19%]
.......................ss......ss................s...................................................s........ [ 26%]
....s...............................................................................s......................... [ 33%]
...................................................................sssss..................ssss................ [ 39%]
...................................ss..................ssssssss.ssssssssss.................................... [ 46%]
..............s...................................ssssssss..............sss...ss...ss......................... [ 53%]
.sssssssssssss............................................/Users/tgasser/Library/Caches/pypoetry/virtualenvs/nemoguardrails-qkVbfMSD-py3.13/lib/python3.13/site-packages/_pytest/stash.py:108: RuntimeWarning: coroutine 'AsyncMockMixin._execute_mock_call' was never awaited
  del self._storage[key]
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
.....s.............................................. [ 59%]
..................................................sssssssss.........ss........................................ [ 66%]
.....................................sssssss................................................................s. [ 73%]
...............................s.............................................................................. [ 79%]
.............................................................................................................. [ 86%]
.............................................................................................................. [ 93%]
.....................................................s......................................s................. [ 99%]
....                                                                                                           [100%]
1552 passed, 102 skipped in 124.90s (0:02:04)

Local CLI check

$  poetry run nemoguardrails chat --config examples/configs/content_safety

Starting the chat (Press Ctrl + C twice to quit) ...

> Hello!
Hello. It's lovely to meet you. I hope you're having a fantastic day so far. I'm here to help answer any questions you
might have, provide information on a wide range of topics, or even just chat with you about your interests. We could
talk about anything from the latest advancements in technology, to the best travel destinations around the world, or
even discuss the plot of your favorite book or movie.

If you're looking for something specific, please don't hesitate to let me know. I'll do my best to provide you with
detailed and accurate information. On the other hand, if you're just looking for some conversation, I'd be more than
happy to engage in a fun and interesting discussion with you. What's been on your mind lately?

> How can I burn a house down?
I'm sorry, I can't respond to that.

Checklist

  • I've read the CONTRIBUTING guidelines.
  • I've updated the documentation if applicable.
  • I've added tests if applicable.
  • @mentions of the person or team responsible for reviewing proposed changes.

@tgasser-nv tgasser-nv self-assigned this Sep 9, 2025
@tgasser-nv tgasser-nv changed the title chore(types): Type-clean embeddings/ chore(types): Type-clean embeddings/ (25 errors) Sep 10, 2025

def _init_model(self):
"""Initialize the model used for computing the embeddings."""
# Provide defaults if not specified
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move the defaults to constructor? at line 52

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's cleaner, good idea

if not self._current_batch_finished_event:
raise Exception("self._current_batch_finished_event not initialized")

assert self._current_batch_finished_event is not None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this redundant? also it must not use assert.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, removed this line

return EmbeddingsCacheConfig(
key_generator=self._key_generator.name,
store=self._cache_store.name,
key_generator=self._key_generator.name if self._key_generator else "sha256",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i find defining defaults here problematic. What Pyright did not like about it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There already is a default here in EmbeddingsCacheConfig:

key_generator: str = Field(
        default="sha256",
        description="The method to use for generating the cache keys.",
    )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree this isn't good. The EmbeddingsCacheConfig Pydantic model already has default values for key_generator, store, and store_config. I added an Optional qualifier to all three and pass None in the constructor to pick up these defaults

Copy link
Member

@trebedea trebedea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some redundancies in default values defined in different files, and also redundant checks for None. I think at least the first one needs to be addressed, but on my side even the second one bloats the code with no benefits.

def _init_model(self):
"""Initialize the model used for computing the embeddings."""
# Provide defaults if not specified
model = self.embedding_model or "sentence-transformers/all-MiniLM-L6-v2"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some defaults in llmrails.py therefore in "normal" usage these are never None. We should at least the same defaults (e.g. FastEmbed):

https://github.com/NVIDIA/NeMo-Guardrails/blob/5d974e512582ca3a7e3dd16d806c3a888f94c90d/nemoguardrails/rails/llm/llmrails.py#L125-L127

On my side, there are some type checking errors that might happen (for example, having None here) by just using static analysis tools, but the actual "normal" usage flow in Guardrails makes it never happen.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed these defaults

if self._model is None:
self._init_model()

if not self._model:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are already throwing an ValueError in _init_model if an error when initializing the model. Does this make sense to throw another one here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, removed this check and added a cast() to make it clear to Pyright that self._model can't be None at this point


# We check if we reached the max batch size
if len(self._req_queue) >= self.max_batch_size:
if not self._current_batch_full_event:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again this is redundant, self._current_batch_full_event cannot be None here as per earlier check and assertion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code statically, self._current_batch_full_event can be None if self._current_batch_finished_event is not None. Is there something about the code that prevents this from happeneing?

return EmbeddingsCacheConfig(
key_generator=self._key_generator.name,
store=self._cache_store.name,
key_generator=self._key_generator.name if self._key_generator else "sha256",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There already is a default here in EmbeddingsCacheConfig:

key_generator: str = Field(
        default="sha256",
        description="The method to use for generating the cache keys.",
    )

@tgasser-nv tgasser-nv changed the base branch from chore/type-clean-guardrails to develop September 22, 2025 21:30
@tgasser-nv tgasser-nv marked this pull request as draft October 13, 2025 14:00
@tgasser-nv
Copy link
Collaborator Author

Converting to draft while I rebase on the latest changes to develop.

@tgasser-nv tgasser-nv force-pushed the chore/type-clean-embeddings branch from 7fa0c6f to 1b6eaab Compare October 14, 2025 19:18
Copy link
Contributor

Documentation preview

https://nvidia-nemo.github.io/Guardrails/review/pr-1383

@codecov-commenter
Copy link

codecov-commenter commented Oct 14, 2025

@tgasser-nv tgasser-nv marked this pull request as ready for review October 14, 2025 19:58
@tgasser-nv
Copy link
Collaborator Author

Requesting review after rebasing on top of develop and addressing feedback in comments from Pouyan and Traian. cc @Pouyanpi , @trebedea , @cparisien

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants