Skip to content

Conversation

luisquintanilla
Copy link
Contributor

@luisquintanilla luisquintanilla commented Oct 9, 2025

This pull request introduces a new conceptual article on data ingestion for AI applications and updates the documentation table of contents to include it. The new article explains the fundamentals of data ingestion, its importance for AI and RAG scenarios, and details the architecture and building blocks provided by the Microsoft.Extensions.DataIngestion library.

Documentation Additions and Updates:

  • Added a comprehensive conceptual article, data-ingestion.md, covering the definition, importance, and technical foundations of data ingestion, with a focus on .NET AI workflows and the Microsoft.Extensions.DataIngestion library.
  • Updated the documentation table of contents (toc.yml) to include the new "Data ingestion" article under the AI conceptual section.

Internal previews

📄 File 🔗 Preview link
docs/ai/conceptual/data-ingestion.md Data Ingestion
docs/ai/toc.yml docs/ai/toc

@luisquintanilla luisquintanilla requested review from a team and gewarren as code owners October 9, 2025 20:19
@Copilot Copilot AI review requested due to automatic review settings October 9, 2025 20:19
@dotnetrepoman dotnetrepoman bot added this to the October 2025 milestone Oct 9, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request introduces comprehensive documentation for data ingestion in AI applications, focusing on the Microsoft.Extensions.DataIngestion library and its integration with .NET AI workflows.

Key changes:

  • Added a new conceptual article explaining data ingestion fundamentals, architecture, and building blocks
  • Updated the table of contents to include the new data ingestion documentation

Reviewed Changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 3 comments.

File Description
docs/ai/toc.yml Added "Data ingestion" entry to the conceptual documentation navigation
docs/ai/conceptual/data-ingestion.md New comprehensive article covering data ingestion concepts, architecture, and the Microsoft.Extensions.DataIngestion library components

This design means you can work with documents from different sources using the same consistent API, making your code more maintainable and flexible.

```csharp
// TODO: Add code snippet
Copy link

Copilot AI Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO comment indicates missing code examples. Consider adding a concrete code snippet showing how to use DocumentReader to load documents from different sources.

Suggested change
// TODO: Add code snippet
// Load a Markdown document from a local file
var markdownReader = new MarkdownDocumentReader();
Document markdownDoc = await markdownReader.ReadAsync("sample.md");
// Load a PDF document from Azure Document Intelligence
var azureReader = new AzureDocumentIntelligenceReader("<your-azure-endpoint>", "<your-api-key>");
Document pdfDoc = await azureReader.ReadAsync("https://contoso.com/files/sample.pdf");

Copilot uses AI. Check for mistakes.

These chunking strategies build on the Microsoft.ML.Tokenizers library to intelligently split text into appropriately sized pieces that work well with large language models. The right chunking strategy depends on your document types and how you plan to retrieve information.

```csharp
// TODO: Add code snippet
Copy link

Copilot AI Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO comment indicates missing code examples. Consider adding a concrete code snippet demonstrating different chunking strategies in practice.

Suggested change
// TODO: Add code snippet
using Microsoft.ML.Tokenizers;
string document = "## Introduction\nThis is the first section. It contains some information.\n## Details\nThis section goes into more detail. It has more tokens and ideas.";
// Token-based chunking
var tokenizer = new BertTokenizer();
var tokens = tokenizer.Tokenize(document);
int chunkSize = 20;
var tokenChunks = tokens
.Select((token, idx) => new { token, idx })
.GroupBy(x => x.idx / chunkSize)
.Select(g => string.Join(" ", g.Select(x => x.token.Text)));
// Section-based chunking (split on headers)
var sectionChunks = document.Split(new[] { "## " }, StringSplitOptions.RemoveEmptyEntries)
.Select(section => "## " + section.Trim());
// Semantic-aware chunking (split on sentences, preserving context)
var sentences = document.Split(new[] { ". " }, StringSplitOptions.RemoveEmptyEntries);
int semanticChunkSize = 2;
var semanticChunks = sentences
.Select((sentence, idx) => new { sentence, idx })
.GroupBy(x => x.idx / semanticChunkSize)
.Select(g => string.Join(". ", g.Select(x => x.sentence)) + ".");
// Output examples
Console.WriteLine("Token-based chunks:");
foreach (var chunk in tokenChunks) Console.WriteLine(chunk);
Console.WriteLine("\nSection-based chunks:");
foreach (var chunk in sectionChunks) Console.WriteLine(chunk);
Console.WriteLine("\nSemantic-aware chunks:");
foreach (var chunk in semanticChunks) Console.WriteLine(chunk);

Copilot uses AI. Check for mistakes.

This pipeline approach reduces boilerplate code and makes it easy to build, test, and maintain complex data ingestion workflows.

```csharp
//TODO: Add code snippet
Copy link

Copilot AI Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO comment indicates missing code examples. Consider adding a concrete code snippet showing how to compose a complete DocumentPipeline with readers, processors, chunkers, and writers.

Suggested change
//TODO: Add code snippet
// Example: Compose a DocumentPipeline with reader, processor, chunker, and writer
var reader = new FileDocumentReader("data/*.pdf");
var processor = new SummarizationProcessor();
var chunker = new ParagraphChunker();
var writer = new VectorStoreDocumentWriter("my-vector-store");
var pipeline = new DocumentPipeline()
.WithReader(reader)
.WithProcessor(processor)
.WithChunker(chunker)
.WithWriter(writer);
// Run the pipeline
await pipeline.RunAsync();

Copilot uses AI. Check for mistakes.

@luisquintanilla luisquintanilla marked this pull request as draft October 10, 2025 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant