[Question] About modeling document-chunk-entity graph in RAG use case #1832

ggyuchive · 2025-05-22T13:37:00Z

ggyuchive
May 22, 2025

Do you need to ask a question?

I have searched the existing question and discussions and this question is not already answered.
I believe this is a legitimate question, not just a bug or feature request.

Your Question

Hi LightRAG team,

We’re working on ontology-based RAG project using supply-chain data and official USTR documents. Now using the LightRAG as a base and relying heavily on a graph engine to model relationships between document and domain, connected by event nodes which hold date information.

We have some questions about best practices for graph modeling for document:

1. About using chunk_id as source_id instead of document_id

In LightRAG’s example(example/insert_custom_kg.py), each chunk is assigned a unique source_id. We’re currently following this, but we wonder if this is ideal for our case.

Since our downstream queries and reasoning are often document-level (e.g., linking supply-chain events to official documents), would it make more sense to assign the source_id based on the document instead of each chunk?

2. Use existing entity node or create new one?

Now we extract entities and relationships from each chunk of a USTR document. Naturally, the same name of entities extract in multiple chunks with slightly different descriptions.
About this, we considers below two approaches and would love your advice:

Approach1. Use existing entity node
Pros: Avoid redundancy, easier to count node with filtering entity name
Cons: Hard to handle descriptions of entity, risk of losing context

Approach2. Create new entity node per chunk
Pros: Keep contextual info intact, no conflict in metadata
Cons: Cause duplication, harder to analyze globally

Thank you so much in advance :)
Your insights would be helpful as scale our RAG project.

Best Regards,
Byeonggyu

Additional Context

No response

danielaskdd · 2025-05-24T04:28:57Z

danielaskdd
May 24, 2025
Maintainer

I am not familiar with your business scenario, so I do not yet understand the problem you need to solve. Could you provide some excerpts from supply-chain data and official USTR documents, along with the entities and relationships extracted from them, as well as the queries to be used in the process and the expected query results?

0 replies

grootmy · 2025-05-25T13:59:00Z

grootmy
May 25, 2025

@danielaskdd

Thank you for your interest in our project! Here are the specific details you requested:

Our goal is to analyze how trade policy announcements impact supply chain operations.

Sample Data

1. USTR Document Data

Date: 2023-05-12
Title: "USTR Extends Certain COVID-Related Exclusions from China Section 301 Tariffs"
Content: WASHINGTON – The Office of the United States Trade Representative today announced the extension of 77 of the 81 COVID-related exclusions in the China Section 301 Investigation. The exclusions were previously scheduled to expire on May 15, 2023...

2. Supply Chain Data (SupplyGraph dataset - Bangladesh FMCG)

Event (date: "2023-05-12")

40 Products: SOS001L12P, POP015K, POV500M24P, etc.
10 Plants: 2120, 1911, 2119, etc.
160 relationships: FACTORYISSUE, PRODUCTION, SALESORDER, DELIVERYTODISTRIBUTOR

Entity Extraction Results

Entity Types: Country, Location, Date, Policy, TariffMeasure, ProductCategory, Document, Action, etc.

From USTR text above:

Entities: China (Country), United States (Country), May 12, 2023 (Date), COVID-Related Exclusions (Policy), Section 301 Tariffs (TariffMeasure), medical-care products (ProductCategory)
Relationships: USTR -EXTENDS-> COVID-Related Exclusions, COVID-Related Exclusions -APPLIES_TO-> China, COVID-Related Exclusions -COVERS-> medical-care products

We are linking USTR documents and supply chain data based on their associated dates.

Core Questions & Real Scenarios

Question 1: chunk_id vs document_id

Current: Each chunk gets unique source_id like "ustr_doc1_chunk1", "ustr_doc1_chunk2"
Dilemma: Our downstream queries are often document-level:

"Find all policy documents mentioning China during supply chain disruption"
"Link May 12 trade announcement to supply chain performance"

Should we use document-level source_id instead?

Question 2: Entity deduplication across chunks

Same entity extracted with different descriptions:

Chunk 1: "China" (description: "country subject to Section 301 tariffs")
Chunk 2: "China" (description: "major supplier of medical products")
Chunk 3: "China" (description: "trade partner affected by COVID exclusions")

Should we:

Approach 1: Merge into one "China" node (avoid redundancy, easier counting)
Approach 2: Keep separate nodes per chunk (preserve context, avoid conflicts)

Which approach works better for cross-document reasoning?

Sample Business Question & Answer

Question: "How did our global supply chain perform during the week before and after the USTR China tariff exclusion extension announcement? Were there any disruptions in production or delivery due to the policy announcement?"

Query:
//Global operations status during the week before and after announcement
MATCH (e:Event)
WHERE e.date >= "2023-05-05" AND e.date <= "2023-05-19"
MATCH (e)-[r]-(n)
RETURN type(r) AS operation_process, count(*) AS completion_count
ORDER BY operation_process

Answer:During the week before and after the policy announcement, all 40 products completed the full 4-stage process from production to customer delivery with 100% success rate, and all 10 plants operated simultaneously without issues, achieving zero operational disruption despite policy uncertainty.

Response:

operation_process	completion_count
"DELIVERYTODISTRIBUTOR"	1200
"FACTORYISSUE"	1200
"OCCURRED_ON"	11
"PRODUCTION"	1200
"SALESORDER"	1200

The example question below demonstrates why document-level reasoning is critical in our use case and how modeling choices (like source_id granularity and entity deduplication) directly impact query outcomes.

Thank you for your time.

0 replies

onestardao · 2025-07-31T06:45:19Z

onestardao
Jul 31, 2025

Really thoughtful problem framing — you're not wrong to question chunk_id vs document_id.

The issue you're likely running into isn't just metadata schema — it's semantic binding drift across graph layers.

TL;DR:

Assigning source_id at chunk level causes semantic fragmentation during retrieval, especially when event-based or timeline queries rely on document-wide coherence.

Duplicating entity nodes per chunk avoids conflict, but explodes the graph and makes symbolic aggregation brittle.

What you’re describing (entity reappearance across slightly different chunk views) is one of the classic signs of symbolic versioning drift — hard to catch, nasty to debug.

We’ve been building infrastructure to handle this class of problem — using symbolic overlays backed by tesseract.js (for parsing stability), and released fully under MIT.

If you’re at the point where reasoning spans documents and events, not just chunks — this is exactly where most RAG pipelines start leaking logic.

Glad to see others are exploring this edge.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question] About modeling document-chunk-entity graph in RAG use case #1832

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[Question] About modeling document-chunk-entity graph in RAG use case #1832

Uh oh!

ggyuchive May 22, 2025

Do you need to ask a question?

Your Question

1. About using chunk_id as source_id instead of document_id

2. Use existing entity node or create new one?

Additional Context

Replies: 3 comments

Uh oh!

danielaskdd May 24, 2025 Maintainer

Uh oh!

grootmy May 25, 2025

Sample Data

1. USTR Document Data

2. Supply Chain Data (SupplyGraph dataset - Bangladesh FMCG)

Entity Extraction Results

Core Questions & Real Scenarios

Question 1: chunk_id vs document_id

Question 2: Entity deduplication across chunks

Sample Business Question & Answer

Uh oh!

onestardao Jul 31, 2025

ggyuchive
May 22, 2025

danielaskdd
May 24, 2025
Maintainer

grootmy
May 25, 2025

onestardao
Jul 31, 2025