-
Notifications
You must be signed in to change notification settings - Fork 138
fix(schema): Use sample instead of find for schema sampling #580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 4 commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
39aa3ac
fix(schema): Use sample instead of find for schema sampling.
kmruiz 4219bd5
Merge branch 'main' into chore/use-sample-instead-of-find-for-schemas
kmruiz 8206acc
chore: use memory limits and support custom sample size
kmruiz 6b81bd3
chore: fix build
kmruiz e0324c2
chore: use custom isObjectEmpty that is O(1) instead of O(N). Importa…
kmruiz 8f31898
chore: Use a hardcoded constant for the maximum upper bounds instead …
kmruiz File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we want to limit the sample to
maxDocumentsPerQuery
- the way I interpreted this config option, it's dealing with the number of documents we'd be returning to the LLM, not necessarily the number of documents we're fetching internally - e.g. the LLM shouldn't care if we sample 50 or 1000 docs since it's only seeing the inferred schema anyway.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be another option, I just wanted to limit in case a model gets crazy and tries to query thousands and thousands of documents for sampling. $sample is a bit more expensive than just finding, so it's just for safety.
No strong opinion here by the way, we can have a specific hardcoded option for sample in a constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to a constant for the upper limit.