Replies: 3 comments
-
I am not familiar with your business scenario, so I do not yet understand the problem you need to solve. Could you provide some excerpts from supply-chain data and official USTR documents, along with the entities and relationships extracted from them, as well as the queries to be used in the process and the expected query results? |
Beta Was this translation helpful? Give feedback.
-
Thank you for your interest in our project! Here are the specific details you requested: Our goal is to analyze how trade policy announcements impact supply chain operations. Sample Data1. USTR Document Data
2. Supply Chain Data (SupplyGraph dataset - Bangladesh FMCG)Event (date: "2023-05-12")
Entity Extraction ResultsEntity Types: Country, Location, Date, Policy, TariffMeasure, ProductCategory, Document, Action, etc. From USTR text above:
We are linking USTR documents and supply chain data based on their associated dates. Core Questions & Real ScenariosQuestion 1: chunk_id vs document_idCurrent: Each chunk gets unique source_id like "ustr_doc1_chunk1", "ustr_doc1_chunk2"
Should we use document-level source_id instead? Question 2: Entity deduplication across chunksSame entity extracted with different descriptions:
Should we:
Which approach works better for cross-document reasoning? Sample Business Question & AnswerQuestion: "How did our global supply chain perform during the week before and after the USTR China tariff exclusion extension announcement? Were there any disruptions in production or delivery due to the policy announcement?" Query: Answer:During the week before and after the policy announcement, all 40 products completed the full 4-stage process from production to customer delivery with 100% success rate, and all 10 plants operated simultaneously without issues, achieving zero operational disruption despite policy uncertainty. Response:
The example question below demonstrates why document-level reasoning is critical in our use case and how modeling choices (like source_id granularity and entity deduplication) directly impact query outcomes. Thank you for your time. |
Beta Was this translation helpful? Give feedback.
-
Really thoughtful problem framing — you're not wrong to question chunk_id vs document_id. The issue you're likely running into isn't just metadata schema — it's semantic binding drift across graph layers. TL;DR:
What you’re describing (entity reappearance across slightly different chunk views) is one of the classic signs of symbolic versioning drift — hard to catch, nasty to debug. We’ve been building infrastructure to handle this class of problem — using symbolic overlays backed by tesseract.js (for parsing stability), and released fully under MIT. If you’re at the point where reasoning spans documents and events, not just chunks — this is exactly where most RAG pipelines start leaking logic. Glad to see others are exploring this edge. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Do you need to ask a question?
Your Question
Hi LightRAG team,
We’re working on ontology-based RAG project using supply-chain data and official USTR documents. Now using the LightRAG as a base and relying heavily on a graph engine to model relationships between document and domain, connected by event nodes which hold date information.
We have some questions about best practices for graph modeling for document:
1. About using chunk_id as source_id instead of document_id
In LightRAG’s example(example/insert_custom_kg.py), each chunk is assigned a unique source_id. We’re currently following this, but we wonder if this is ideal for our case.
Since our downstream queries and reasoning are often document-level (e.g., linking supply-chain events to official documents), would it make more sense to assign the source_id based on the document instead of each chunk?
2. Use existing entity node or create new one?
Now we extract entities and relationships from each chunk of a USTR document. Naturally, the same name of entities extract in multiple chunks with slightly different descriptions.
About this, we considers below two approaches and would love your advice:
Approach1. Use existing entity node
Pros: Avoid redundancy, easier to count node with filtering entity name
Cons: Hard to handle descriptions of entity, risk of losing context
Approach2. Create new entity node per chunk
Pros: Keep contextual info intact, no conflict in metadata
Cons: Cause duplication, harder to analyze globally
Thank you so much in advance :)
Your insights would be helpful as scale our RAG project.
Best Regards,
Byeonggyu
Additional Context
No response
Beta Was this translation helpful? Give feedback.
All reactions