Skip to content

[EPIC] Expand Knowledge Document Ingestion Pipeline #324

@aakankshaduggal

Description

@aakankshaduggal

Add support for ingesting and processing various document types (Markdown, PDF, DOCX, etc.) into formats compatible with SDG workflows.

Key Features:

  • InstructLab Schema: Define an instructlab schema to standardize input formats for SDG and RAG.
  • Docling Integration: Use Docling for converting document formats (PDF, DOCX, HTML) into JSON-compatible schema.
  • Document Chunking Command: Develop ilab document format for chunking and formatting documents as per SDG schema.
  • Simplified Git Workflows: Introduce script to handle Git repo setup, structure, and file organization for knowledge documents.

Metadata

Metadata

Labels

epicLarger tracking issue encompassing multiple smaller issuesstale

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions