-
Notifications
You must be signed in to change notification settings - Fork 137
fix(schema): Use sample instead of find for schema sampling #580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR improves the schema sampling mechanism in the collection schema tool by replacing the biased find
operation with the statistical $sample
aggregation operation.
Key changes:
- Replaces
find
withaggregate
using$sample
to get truly random documents - Increases sample size from 5 to 50 documents for better schema inference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but should we use collectCursorLogic here to not exceed the memory limits?
📊 Accuracy Test Results📈 Summary
📊 Baseline Comparison
📎 Download Full HTML Report - Look for the Report generated on: 9/22/2025, 10:28:50 AM |
I actually believe Copilots idea and yours are good, so I'll add both a sampleSize parameter and also use the new cursor logic. This way we can just upgrade this tool to be as reliable as others. |
Pull Request Test Coverage Report for Build 17914435472Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (2)
src/tools/mongodb/metadata/collectionSchema.ts:1
- There are trailing spaces after the commas on these lines. Remove the extra whitespace to maintain consistent formatting.
import type { CallToolResult } from "@modelcontextprotocol/sdk/types.js";
src/tools/mongodb/metadata/collectionSchema.ts:1
- There are trailing spaces after the commas on these lines. Remove the extra whitespace to maintain consistent formatting.
import type { CallToolResult } from "@modelcontextprotocol/sdk/types.js";
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question, otherwise looks good (once prettier is made happy).
): Promise<CallToolResult> { | ||
const provider = await this.ensureConnected(); | ||
const documents = await provider.find(database, collection, {}, { limit: 5 }).toArray(); | ||
const cursor = provider.aggregate(database, collection, [{ $sample: { size: Math.min(sampleSize, this.config.maxDocumentsPerQuery) } }]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we want to limit the sample to maxDocumentsPerQuery
- the way I interpreted this config option, it's dealing with the number of documents we'd be returning to the LLM, not necessarily the number of documents we're fetching internally - e.g. the LLM shouldn't care if we sample 50 or 1000 docs since it's only seeing the inferred schema anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be another option, I just wanted to limit in case a model gets crazy and tries to query thousands and thousands of documents for sampling. $sample is a bit more expensive than just finding, so it's just for safety.
No strong opinion here by the way, we can have a specific hardcoded option for sample in a constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to a constant for the upper limit.
…nt for large schemas
…of config.maxDocumentsPerQuery
6fe785f
to
8f31898
Compare
Proposed changes
Using find without sorting will return documents in natural order, which is not reliable for sampling because it depends on the last "updated" time a document was updated. By using find, we are biased towards the latest documents in a database that might not even be up to date, introducing bias.
The
$sample
works differently: it actually does an statistical sample by getting random documents from different random places within the same collection. This is more reliable on collections where not all documents have the same amount of fields.Checklist