Data Ingestion #49051

luisquintanilla · 2025-10-09T20:19:43Z

This pull request introduces a new conceptual article on data ingestion for AI applications and updates the documentation table of contents to include it. The new article explains the fundamentals of data ingestion, its importance for AI and RAG scenarios, and details the architecture and building blocks provided by the Microsoft.Extensions.DataIngestion library.

Documentation Additions and Updates:

Added a comprehensive conceptual article, data-ingestion.md, covering the definition, importance, and technical foundations of data ingestion, with a focus on .NET AI workflows and the Microsoft.Extensions.DataIngestion library.
Updated the documentation table of contents (toc.yml) to include the new "Data ingestion" article under the AI conceptual section.

Internal previews

📄 File	🔗 Preview link
docs/ai/conceptual/data-ingestion.md	Data Ingestion
docs/ai/toc.yml	docs/ai/toc

Copilot

Pull Request Overview

This pull request introduces comprehensive documentation for data ingestion in AI applications, focusing on the Microsoft.Extensions.DataIngestion library and its integration with .NET AI workflows.

Key changes:

Added a new conceptual article explaining data ingestion fundamentals, architecture, and building blocks
Updated the table of contents to include the new data ingestion documentation

Reviewed Changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 3 comments.

File	Description
docs/ai/toc.yml	Added "Data ingestion" entry to the conceptual documentation navigation
docs/ai/conceptual/data-ingestion.md	New comprehensive article covering data ingestion concepts, architecture, and the Microsoft.Extensions.DataIngestion library components

Copilot · 2025-10-09T20:20:22Z

docs/ai/conceptual/data-ingestion.md

+This design means you can work with documents from different sources using the same consistent API, making your code more maintainable and flexible.
+
+```csharp
+// TODO: Add code snippet


The TODO comment indicates missing code examples. Consider adding a concrete code snippet showing how to use DocumentReader to load documents from different sources.

Suggested change

// TODO: Add code snippet

// Load a Markdown document from a local file

var markdownReader = new MarkdownDocumentReader();

Document markdownDoc = await markdownReader.ReadAsync("sample.md");

// Load a PDF document from Azure Document Intelligence

var azureReader = new AzureDocumentIntelligenceReader("<your-azure-endpoint>", "<your-api-key>");

Document pdfDoc = await azureReader.ReadAsync("https://contoso.com/files/sample.pdf");

Copilot · 2025-10-09T20:20:23Z

docs/ai/conceptual/data-ingestion.md

+These chunking strategies build on the Microsoft.ML.Tokenizers library to intelligently split text into appropriately sized pieces that work well with large language models. The right chunking strategy depends on your document types and how you plan to retrieve information.
+
+```csharp
+// TODO: Add code snippet


The TODO comment indicates missing code examples. Consider adding a concrete code snippet demonstrating different chunking strategies in practice.

Suggested change

// TODO: Add code snippet

using Microsoft.ML.Tokenizers;

string document = "## Introduction\nThis is the first section. It contains some information.\n## Details\nThis section goes into more detail. It has more tokens and ideas.";

// Token-based chunking

var tokenizer = new BertTokenizer();

var tokens = tokenizer.Tokenize(document);

int chunkSize = 20;

var tokenChunks = tokens

.Select((token, idx) => new { token, idx })

.GroupBy(x => x.idx / chunkSize)

.Select(g => string.Join(" ", g.Select(x => x.token.Text)));

// Section-based chunking (split on headers)

var sectionChunks = document.Split(new[] { "## " }, StringSplitOptions.RemoveEmptyEntries)

.Select(section => "## " + section.Trim());

// Semantic-aware chunking (split on sentences, preserving context)

var sentences = document.Split(new[] { ". " }, StringSplitOptions.RemoveEmptyEntries);

int semanticChunkSize = 2;

var semanticChunks = sentences

.Select((sentence, idx) => new { sentence, idx })

.GroupBy(x => x.idx / semanticChunkSize)

.Select(g => string.Join(". ", g.Select(x => x.sentence)) + ".");

// Output examples

Console.WriteLine("Token-based chunks:");

foreach (var chunk in tokenChunks) Console.WriteLine(chunk);

Console.WriteLine("\nSection-based chunks:");

foreach (var chunk in sectionChunks) Console.WriteLine(chunk);

Console.WriteLine("\nSemantic-aware chunks:");

foreach (var chunk in semanticChunks) Console.WriteLine(chunk);

Copilot · 2025-10-09T20:20:23Z

docs/ai/conceptual/data-ingestion.md

+This pipeline approach reduces boilerplate code and makes it easy to build, test, and maintain complex data ingestion workflows.
+
+```csharp
+//TODO: Add code snippet


The TODO comment indicates missing code examples. Consider adding a concrete code snippet showing how to compose a complete DocumentPipeline with readers, processors, chunkers, and writers.

Suggested change

//TODO: Add code snippet

// Example: Compose a DocumentPipeline with reader, processor, chunker, and writer

var reader = new FileDocumentReader("data/*.pdf");

var processor = new SummarizationProcessor();

var chunker = new ParagraphChunker();

var writer = new VectorStoreDocumentWriter("my-vector-store");

var pipeline = new DocumentPipeline()

.WithReader(reader)

.WithProcessor(processor)

.WithChunker(chunker)

.WithWriter(writer);

// Run the pipeline

await pipeline.RunAsync();

Initial commit

a0b3741

luisquintanilla requested review from a team and gewarren as code owners October 9, 2025 20:19

Copilot AI review requested due to automatic review settings October 9, 2025 20:19

dotnetrepoman bot added this to the October 2025 milestone Oct 9, 2025

Copilot AI reviewed Oct 9, 2025

View reviewed changes

luisquintanilla marked this pull request as draft October 10, 2025 14:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data Ingestion #49051

Data Ingestion #49051

Uh oh!

luisquintanilla commented Oct 9, 2025 •

edited by github-actions bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 9, 2025

Uh oh!

Copilot AI Oct 9, 2025

Uh oh!

Copilot AI Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-// TODO: Add code snippet
+// Load a Markdown document from a local file
+var markdownReader = new MarkdownDocumentReader();
+Document markdownDoc = await markdownReader.ReadAsync("sample.md");
+// Load a PDF document from Azure Document Intelligence
+var azureReader = new AzureDocumentIntelligenceReader("<your-azure-endpoint>", "<your-api-key>");
+Document pdfDoc = await azureReader.ReadAsync("https://contoso.com/files/sample.pdf");

-// TODO: Add code snippet
+using Microsoft.ML.Tokenizers;
+string document = "## Introduction\nThis is the first section. It contains some information.\n## Details\nThis section goes into more detail. It has more tokens and ideas.";
+// Token-based chunking
+var tokenizer = new BertTokenizer();
+var tokens = tokenizer.Tokenize(document);
+int chunkSize = 20;
+var tokenChunks = tokens
+    .Select((token, idx) => new { token, idx })
+    .GroupBy(x => x.idx / chunkSize)
+    .Select(g => string.Join(" ", g.Select(x => x.token.Text)));
+// Section-based chunking (split on headers)
+var sectionChunks = document.Split(new[] { "## " }, StringSplitOptions.RemoveEmptyEntries)
+    .Select(section => "## " + section.Trim());
+// Semantic-aware chunking (split on sentences, preserving context)
+var sentences = document.Split(new[] { ". " }, StringSplitOptions.RemoveEmptyEntries);
+int semanticChunkSize = 2;
+var semanticChunks = sentences
+    .Select((sentence, idx) => new { sentence, idx })
+    .GroupBy(x => x.idx / semanticChunkSize)
+    .Select(g => string.Join(". ", g.Select(x => x.sentence)) + ".");
+// Output examples
+Console.WriteLine("Token-based chunks:");
+foreach (var chunk in tokenChunks) Console.WriteLine(chunk);
+Console.WriteLine("\nSection-based chunks:");
+foreach (var chunk in sectionChunks) Console.WriteLine(chunk);
+Console.WriteLine("\nSemantic-aware chunks:");
+foreach (var chunk in semanticChunks) Console.WriteLine(chunk);

-//TODO: Add code snippet
+// Example: Compose a DocumentPipeline with reader, processor, chunker, and writer
+var reader = new FileDocumentReader("data/*.pdf");
+var processor = new SummarizationProcessor();
+var chunker = new ParagraphChunker();
+var writer = new VectorStoreDocumentWriter("my-vector-store");
+var pipeline = new DocumentPipeline()
+    .WithReader(reader)
+    .WithProcessor(processor)
+    .WithChunker(chunker)
+    .WithWriter(writer);
+// Run the pipeline
+await pipeline.RunAsync();

Data Ingestion #49051

Are you sure you want to change the base?

Data Ingestion #49051

Uh oh!

Conversation

luisquintanilla commented Oct 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Internal previews

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

luisquintanilla commented Oct 9, 2025 •

edited by github-actions bot

Loading