Skip to content

Conversation

kmruiz
Copy link
Collaborator

@kmruiz kmruiz commented Sep 22, 2025

Proposed changes

Using find without sorting will return documents in natural order, which is not reliable for sampling because it depends on the last "updated" time a document was updated. By using find, we are biased towards the latest documents in a database that might not even be up to date, introducing bias.

The $sample works differently: it actually does an statistical sample by getting random documents from different random places within the same collection. This is more reliable on collections where not all documents have the same amount of fields.

Checklist

@Copilot Copilot AI review requested due to automatic review settings September 22, 2025 10:24
@kmruiz kmruiz requested a review from a team as a code owner September 22, 2025 10:24
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves the schema sampling mechanism in the collection schema tool by replacing the biased find operation with the statistical $sample aggregation operation.

Key changes:

  • Replaces find with aggregate using $sample to get truly random documents
  • Increases sample size from 5 to 50 documents for better schema inference

@kmruiz kmruiz self-assigned this Sep 22, 2025
Copy link
Collaborator

@himanshusinghs himanshusinghs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but should we use collectCursorLogic here to not exceed the memory limits?

Copy link
Contributor

📊 Accuracy Test Results

📈 Summary

Metric Value
Commit SHA eee9568abe0a90423cafc446e8dd44387a8157f5
Run ID 7eae5508-bc1d-4693-bb59-1d6830529bd2
Status done
Total Prompts Evaluated 61
Models Tested 1
Average Accuracy 88.5%
Responses with 0% Accuracy 6
Responses with 75% Accuracy 4
Responses with 100% Accuracy 51

📊 Baseline Comparison

Metric Value
Baseline Commit 9f4c48b786d16093ae2936c2b8ddc270221eaaed
Baseline Run ID ca24c181-d9a9-4669-9982-4cdc1df5939f
Baseline Run Status done
Responses Improved 0
Responses Regressed 2

📎 Download Full HTML Report - Look for the accuracy-test-summary artifact for detailed results.

Report generated on: 9/22/2025, 10:28:50 AM

@kmruiz
Copy link
Collaborator Author

kmruiz commented Sep 22, 2025

I actually believe Copilots idea and yours are good, so I'll add both a sampleSize parameter and also use the new cursor logic. This way we can just upgrade this tool to be as reliable as others.

@coveralls
Copy link
Collaborator

coveralls commented Sep 22, 2025

Pull Request Test Coverage Report for Build 17914435472

Details

  • 47 of 48 (97.92%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.1%) to 82.472%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/tools/mongodb/metadata/collectionSchema.ts 36 37 97.3%
Totals Coverage Status
Change from base Build 17911130906: 0.1%
Covered Lines: 5276
Relevant Lines: 6286

💛 - Coveralls

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (2)

src/tools/mongodb/metadata/collectionSchema.ts:1

  • There are trailing spaces after the commas on these lines. Remove the extra whitespace to maintain consistent formatting.
import type { CallToolResult } from "@modelcontextprotocol/sdk/types.js";

src/tools/mongodb/metadata/collectionSchema.ts:1

  • There are trailing spaces after the commas on these lines. Remove the extra whitespace to maintain consistent formatting.
import type { CallToolResult } from "@modelcontextprotocol/sdk/types.js";

Copy link
Collaborator

@nirinchev nirinchev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question, otherwise looks good (once prettier is made happy).

): Promise<CallToolResult> {
const provider = await this.ensureConnected();
const documents = await provider.find(database, collection, {}, { limit: 5 }).toArray();
const cursor = provider.aggregate(database, collection, [{ $sample: { size: Math.min(sampleSize, this.config.maxDocumentsPerQuery) } }]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we want to limit the sample to maxDocumentsPerQuery - the way I interpreted this config option, it's dealing with the number of documents we'd be returning to the LLM, not necessarily the number of documents we're fetching internally - e.g. the LLM shouldn't care if we sample 50 or 1000 docs since it's only seeing the inferred schema anyway.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be another option, I just wanted to limit in case a model gets crazy and tries to query thousands and thousands of documents for sampling. $sample is a bit more expensive than just finding, so it's just for safety.

No strong opinion here by the way, we can have a specific hardcoded option for sample in a constant.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to a constant for the upper limit.

@kmruiz kmruiz force-pushed the chore/use-sample-instead-of-find-for-schemas branch from 6fe785f to 8f31898 Compare September 22, 2025 11:54
@kmruiz kmruiz enabled auto-merge (squash) September 22, 2025 12:05
@kmruiz kmruiz disabled auto-merge September 22, 2025 12:14
@kmruiz kmruiz merged commit 6b8fbd1 into main Sep 22, 2025
23 of 27 checks passed
@kmruiz kmruiz deleted the chore/use-sample-instead-of-find-for-schemas branch September 22, 2025 12:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants