fix(schema): Use sample instead of find for schema sampling #580

kmruiz · 2025-09-22T10:24:40Z

Proposed changes

Using find without sorting will return documents in natural order, which is not reliable for sampling because it depends on the last "updated" time a document was updated. By using find, we are biased towards the latest documents in a database that might not even be up to date, introducing bias.

The $sample works differently: it actually does an statistical sample by getting random documents from different random places within the same collection. This is more reliable on collections where not all documents have the same amount of fields.

Checklist

I have signed the MongoDB CLA

Copilot

Pull Request Overview

This PR improves the schema sampling mechanism in the collection schema tool by replacing the biased find operation with the statistical $sample aggregation operation.

Key changes:

Replaces find with aggregate using $sample to get truly random documents
Increases sample size from 5 to 50 documents for better schema inference

src/tools/mongodb/metadata/collectionSchema.ts

himanshusinghs

Looks good, but should we use collectCursorLogic here to not exceed the memory limits?

github-actions · 2025-09-22T10:28:52Z

📊 Accuracy Test Results

📈 Summary

Metric	Value
Commit SHA	`eee9568abe0a90423cafc446e8dd44387a8157f5`
Run ID	`7eae5508-bc1d-4693-bb59-1d6830529bd2`
Status	done
Total Prompts Evaluated	61
Models Tested	1
Average Accuracy	88.5%
Responses with 0% Accuracy	6
Responses with 75% Accuracy	4
Responses with 100% Accuracy	51

📊 Baseline Comparison

Metric	Value
Baseline Commit	`9f4c48b786d16093ae2936c2b8ddc270221eaaed`
Baseline Run ID	`ca24c181-d9a9-4669-9982-4cdc1df5939f`
Baseline Run Status	`done`
Responses Improved	0
Responses Regressed	2

📎 Download Full HTML Report - Look for the accuracy-test-summary artifact for detailed results.

Report generated on: 9/22/2025, 10:28:50 AM

kmruiz · 2025-09-22T10:31:59Z

I actually believe Copilots idea and yours are good, so I'll add both a sampleSize parameter and also use the new cursor logic. This way we can just upgrade this tool to be as reliable as others.

coveralls · 2025-09-22T10:36:36Z

Pull Request Test Coverage Report for Build 17914435472

Details

47 of 48 (97.92%) changed or added relevant lines in 2 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.1%) to 82.472%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
src/tools/mongodb/metadata/collectionSchema.ts	36	37	97.3%

Totals
Change from base Build 17911130906:	0.1%
Covered Lines:	5276
Relevant Lines:	6286

💛 - Coveralls

Copilot

Pull Request Overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (2)

src/tools/mongodb/metadata/collectionSchema.ts:1

There are trailing spaces after the commas on these lines. Remove the extra whitespace to maintain consistent formatting.

import type { CallToolResult } from "@modelcontextprotocol/sdk/types.js";

src/tools/mongodb/metadata/collectionSchema.ts:1

There are trailing spaces after the commas on these lines. Remove the extra whitespace to maintain consistent formatting.

import type { CallToolResult } from "@modelcontextprotocol/sdk/types.js";

src/tools/mongodb/metadata/collectionSchema.ts

nirinchev

One question, otherwise looks good (once prettier is made happy).

nirinchev · 2025-09-22T11:42:06Z

src/tools/mongodb/metadata/collectionSchema.ts

+    ): Promise<CallToolResult> {
        const provider = await this.ensureConnected();
-        const documents = await provider.find(database, collection, {}, { limit: 5 }).toArray();
+        const cursor = provider.aggregate(database, collection, [{ $sample: { size: Math.min(sampleSize, this.config.maxDocumentsPerQuery) } }]);


I wonder if we want to limit the sample to maxDocumentsPerQuery - the way I interpreted this config option, it's dealing with the number of documents we'd be returning to the LLM, not necessarily the number of documents we're fetching internally - e.g. the LLM shouldn't care if we sample 50 or 1000 docs since it's only seeing the inferred schema anyway.

It could be another option, I just wanted to limit in case a model gets crazy and tries to query thousands and thousands of documents for sampling. $sample is a bit more expensive than just finding, so it's just for safety.

No strong opinion here by the way, we can have a specific hardcoded option for sample in a constant.

Changed to a constant for the upper limit.

…nt for large schemas

…of config.maxDocumentsPerQuery

fix(schema): Use sample instead of find for schema sampling.

39aa3ac

Copilot AI review requested due to automatic review settings September 22, 2025 10:24

kmruiz requested a review from a team as a code owner September 22, 2025 10:24

kmruiz added the accuracy-tests label Sep 22, 2025

Merge branch 'main' into chore/use-sample-instead-of-find-for-schemas

4219bd5

Copilot AI reviewed Sep 22, 2025

View reviewed changes

src/tools/mongodb/metadata/collectionSchema.ts Outdated Show resolved Hide resolved

kmruiz self-assigned this Sep 22, 2025

himanshusinghs reviewed Sep 22, 2025

View reviewed changes

chore: use memory limits and support custom sample size

8206acc

kmruiz requested review from Copilot and himanshusinghs September 22, 2025 11:25

Copilot AI reviewed Sep 22, 2025

View reviewed changes

src/tools/mongodb/metadata/collectionSchema.ts Outdated Show resolved Hide resolved

chore: fix build

6b81bd3

nirinchev approved these changes Sep 22, 2025

View reviewed changes

kmruiz added 2 commits September 22, 2025 13:43

chore: use custom isObjectEmpty that is O(1) instead of O(N). Importa…

e0324c2

…nt for large schemas

chore: Use a hardcoded constant for the maximum upper bounds instead …

8f31898

…of config.maxDocumentsPerQuery

kmruiz force-pushed the chore/use-sample-instead-of-find-for-schemas branch from 6fe785f to 8f31898 Compare September 22, 2025 11:54

kmruiz enabled auto-merge (squash) September 22, 2025 12:05

kmruiz disabled auto-merge September 22, 2025 12:14

kmruiz merged commit 6b8fbd1 into main Sep 22, 2025
23 of 27 checks passed

kmruiz deleted the chore/use-sample-instead-of-find-for-schemas branch September 22, 2025 12:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(schema): Use sample instead of find for schema sampling #580

fix(schema): Use sample instead of find for schema sampling #580

Uh oh!

kmruiz commented Sep 22, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

himanshusinghs left a comment

Uh oh!

github-actions bot commented Sep 22, 2025

Uh oh!

kmruiz commented Sep 22, 2025

Uh oh!

coveralls commented Sep 22, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

nirinchev left a comment

Uh oh!

nirinchev Sep 22, 2025

Uh oh!

kmruiz Sep 22, 2025

Uh oh!

kmruiz Sep 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix(schema): Use sample instead of find for schema sampling #580

fix(schema): Use sample instead of find for schema sampling #580

Uh oh!

Conversation

kmruiz commented Sep 22, 2025

Proposed changes

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

himanshusinghs left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 22, 2025

📊 Accuracy Test Results

📈 Summary

📊 Baseline Comparison

Uh oh!

kmruiz commented Sep 22, 2025

Uh oh!

coveralls commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 17914435472

Details

💛 - Coveralls

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

nirinchev left a comment

Choose a reason for hiding this comment

Uh oh!

nirinchev Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

kmruiz Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

kmruiz Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coveralls commented Sep 22, 2025 •

edited

Loading