-
Notifications
You must be signed in to change notification settings - Fork 6k
Data Ingestion #49051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Data Ingestion #49051
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request introduces comprehensive documentation for data ingestion in AI applications, focusing on the Microsoft.Extensions.DataIngestion
library and its integration with .NET AI workflows.
Key changes:
- Added a new conceptual article explaining data ingestion fundamentals, architecture, and building blocks
- Updated the table of contents to include the new data ingestion documentation
Reviewed Changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 3 comments.
File | Description |
---|---|
docs/ai/toc.yml | Added "Data ingestion" entry to the conceptual documentation navigation |
docs/ai/conceptual/data-ingestion.md | New comprehensive article covering data ingestion concepts, architecture, and the Microsoft.Extensions.DataIngestion library components |
This design means you can work with documents from different sources using the same consistent API, making your code more maintainable and flexible. | ||
|
||
```csharp | ||
// TODO: Add code snippet |
Copilot
AI
Oct 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO comment indicates missing code examples. Consider adding a concrete code snippet showing how to use DocumentReader to load documents from different sources.
// TODO: Add code snippet | |
// Load a Markdown document from a local file | |
var markdownReader = new MarkdownDocumentReader(); | |
Document markdownDoc = await markdownReader.ReadAsync("sample.md"); | |
// Load a PDF document from Azure Document Intelligence | |
var azureReader = new AzureDocumentIntelligenceReader("<your-azure-endpoint>", "<your-api-key>"); | |
Document pdfDoc = await azureReader.ReadAsync("https://contoso.com/files/sample.pdf"); |
Copilot uses AI. Check for mistakes.
These chunking strategies build on the Microsoft.ML.Tokenizers library to intelligently split text into appropriately sized pieces that work well with large language models. The right chunking strategy depends on your document types and how you plan to retrieve information. | ||
|
||
```csharp | ||
// TODO: Add code snippet |
Copilot
AI
Oct 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO comment indicates missing code examples. Consider adding a concrete code snippet demonstrating different chunking strategies in practice.
// TODO: Add code snippet | |
using Microsoft.ML.Tokenizers; | |
string document = "## Introduction\nThis is the first section. It contains some information.\n## Details\nThis section goes into more detail. It has more tokens and ideas."; | |
// Token-based chunking | |
var tokenizer = new BertTokenizer(); | |
var tokens = tokenizer.Tokenize(document); | |
int chunkSize = 20; | |
var tokenChunks = tokens | |
.Select((token, idx) => new { token, idx }) | |
.GroupBy(x => x.idx / chunkSize) | |
.Select(g => string.Join(" ", g.Select(x => x.token.Text))); | |
// Section-based chunking (split on headers) | |
var sectionChunks = document.Split(new[] { "## " }, StringSplitOptions.RemoveEmptyEntries) | |
.Select(section => "## " + section.Trim()); | |
// Semantic-aware chunking (split on sentences, preserving context) | |
var sentences = document.Split(new[] { ". " }, StringSplitOptions.RemoveEmptyEntries); | |
int semanticChunkSize = 2; | |
var semanticChunks = sentences | |
.Select((sentence, idx) => new { sentence, idx }) | |
.GroupBy(x => x.idx / semanticChunkSize) | |
.Select(g => string.Join(". ", g.Select(x => x.sentence)) + "."); | |
// Output examples | |
Console.WriteLine("Token-based chunks:"); | |
foreach (var chunk in tokenChunks) Console.WriteLine(chunk); | |
Console.WriteLine("\nSection-based chunks:"); | |
foreach (var chunk in sectionChunks) Console.WriteLine(chunk); | |
Console.WriteLine("\nSemantic-aware chunks:"); | |
foreach (var chunk in semanticChunks) Console.WriteLine(chunk); |
Copilot uses AI. Check for mistakes.
This pipeline approach reduces boilerplate code and makes it easy to build, test, and maintain complex data ingestion workflows. | ||
|
||
```csharp | ||
//TODO: Add code snippet |
Copilot
AI
Oct 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO comment indicates missing code examples. Consider adding a concrete code snippet showing how to compose a complete DocumentPipeline with readers, processors, chunkers, and writers.
//TODO: Add code snippet | |
// Example: Compose a DocumentPipeline with reader, processor, chunker, and writer | |
var reader = new FileDocumentReader("data/*.pdf"); | |
var processor = new SummarizationProcessor(); | |
var chunker = new ParagraphChunker(); | |
var writer = new VectorStoreDocumentWriter("my-vector-store"); | |
var pipeline = new DocumentPipeline() | |
.WithReader(reader) | |
.WithProcessor(processor) | |
.WithChunker(chunker) | |
.WithWriter(writer); | |
// Run the pipeline | |
await pipeline.RunAsync(); |
Copilot uses AI. Check for mistakes.
This pull request introduces a new conceptual article on data ingestion for AI applications and updates the documentation table of contents to include it. The new article explains the fundamentals of data ingestion, its importance for AI and RAG scenarios, and details the architecture and building blocks provided by the
Microsoft.Extensions.DataIngestion
library.Documentation Additions and Updates:
data-ingestion.md
, covering the definition, importance, and technical foundations of data ingestion, with a focus on .NET AI workflows and theMicrosoft.Extensions.DataIngestion
library.toc.yml
) to include the new "Data ingestion" article under the AI conceptual section.Internal previews