Blog: Embedding User-Defined Indexes in Apache Parquet Files #79

zhuqi-lucas · 2025-07-04T07:40:58Z

Try to blog our work for the custom parquet example for datafusion:

Add an example of embedding indexes inside a parquet file datafusion#16395
Closes Blog Post for Accelerating Query Processing with Specialized Indexes datafusion#16372

This is the initial draft version, we need to polish it.

zhuqi-lucas · 2025-07-04T07:42:17Z

I am not expert for blog, welcome folks to polish it together, thanks a lot! cc @alamb

2010YOUY01 · 2025-07-04T08:57:07Z

This post is great, I find the content easy to follow.

I have a suggestion for the first paragraph though: perhaps we should emphasize the motivation more clearly at the beginning. I think @alamb 's point in the YouTube video is particularly compelling — we don’t need to invent a new file format to support additional indexing. Instead, we can extend Parquet with custom indexes without compromising the file format’s interchangeability.

zhuqi-lucas · 2025-07-04T09:40:00Z

This post is great, I find the content easy to follow.

I have a suggestion for the first paragraph though: perhaps we should emphasize the motivation more clearly at the beginning. I think @alamb 's point in the YouTube video is particularly compelling — we don’t need to invent a new file format to support additional indexing. Instead, we can extend Parquet with custom indexes without compromising the file format’s interchangeability.

Thank you @2010YOUY01 for review, good point, in latest version, i added the point that we don't need a new format, parquet itself is very good.

alamb · 2025-07-04T09:54:56Z

This is amazing -- thank you @zhuqi-lucas and @2010YOUY01 -- I will review this asap, but as today is a holiday in the US I may not have a chance to do so until tomorrow.

zhuqi-lucas · 2025-07-04T09:58:48Z

Thank you @alamb , i will keep polishing it before you reviewing!

alamb

Thank you so much @zhuqi-lucas

I left some "big picture" comments - the main one is to suggest we structure this post follow the "low key technical evangelism" style:

Teach the readers something general (in this case how parquet files are laid out and what the standard index structures are)
Explain how to use DataFusion to do something cool with this tech (in this case make a custom index)

I made some diagrams to go along with the background parquet section. If you like them I can push them into this PR if you like and work on the background section

https://docs.google.com/presentation/d/1aFjTLEDJyDqzFZHgcmRxecCvLKKXV2OvyEpTQFCNZPw/edit?slide=id.g33d7337a5a0_0_85 (Happy to give you edit access too -- just request on the slides and i will do so)

alamb · 2025-07-04T18:01:43Z

content/blog/datafusion-custom-parquet-index.md

+* **Risks synchronization issues:** Removing or renaming one file breaks the index.
+* **Reduces portability:** Harder to share or move Parquet data when the index is external.
+
+Meanwhile, critics of Parquet’s extensibility point to the lack of a *standard* way to embed auxiliary data (see Amudai). But in practice, Parquet tolerates unknown content gracefully:


Here is a link to the amudai docs that might be good to include: https://github.com/microsoft/amudai/blob/main/docs/spec/src/what_about_parquet.md

Good suggestion @alamb !

alamb · 2025-07-04T18:05:11Z

content/blog/datafusion-custom-parquet-index.md

@@ -0,0 +1,232 @@
+## Accelerating Query Processing in DataFusion with Embedded Parquet Indexes
+
+It’s a common misconception that Parquet can only deliver basic Min/Max pruning and Bloom filters—and that adding anything “smarter” requires inventing a whole new file format. In fact, Parquet’s design already lets you embed custom indexing data *inside* the file (via unused footer metadata and byte regions) without breaking compatibility. In this post, we’ll show how DataFusion can leverage a **compact distinct‑value index** written directly into Parquet files—preserving complete interchangeability with other tools—while enabling ultra‑fast file‑level pruning.


This is a great introduction.

I suggest we also add some background on Parquet in general to make this post more self contained (I can do this too)

Something like

" In this post, we’ll briefly review the Apache Parquet file format, explain how arbitrary indexes can be stored in Parquet files files, and then show how to use Apache DataFusion to store and use a custom index in Parquet files, all while preserving complete interchangeability with other tools."

Good suggestion @alamb !

alamb · 2025-07-04T18:05:31Z

content/blog/datafusion-custom-parquet-index.md

+
+It’s a common misconception that Parquet can only deliver basic Min/Max pruning and Bloom filters—and that adding anything “smarter” requires inventing a whole new file format. In fact, Parquet’s design already lets you embed custom indexing data *inside* the file (via unused footer metadata and byte regions) without breaking compatibility. In this post, we’ll show how DataFusion can leverage a **compact distinct‑value index** written directly into Parquet files—preserving complete interchangeability with other tools—while enabling ultra‑fast file‑level pruning.
+
+And besides the custom index, a straightforward rewritten parquet file can have good improvement also. For example, rewriting ClickBench partitioned dataset with better settings* (not resorting) improves


FYI @JigaoLuo and @XiangpengHao have been discussing this topic as well here:

XiangpengHao/liquid-cache#227 -- we could perhaps direct readers there for mroe information and insight

Good suggestion @alamb !

alamb · 2025-07-04T18:06:52Z

content/blog/datafusion-custom-parquet-index.md

+4. Demonstrate end‑to‑end examples (including DuckDB compatibility) using code from
+   [`parquet_embedded_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_embedded_index.rs).
+
+> **Prerequisite:** this example requires the new “buffered write” API in


I think we should phrase this as a release version that peple can see (arrow-rs 55.2.0) -- also, I think it should be put closter to the actual code example (it doesn't need to be in the introduction)

Good point @alamb !

content/blog/datafusion-custom-parquet-index.md

alamb · 2025-07-04T18:16:42Z

content/blog/datafusion-custom-parquet-index.md

+
+---
+
+## High‑Level Design


I suggest we keep the post at a higher level here and omit the details of the special index's structure (and refer readers to the example instead).

In my mind the main points we are trying to get across in the article are:

Parquet is extensible with custom indexes

You can use DataFusion to write and read them

The idea is that readers will want to put their own special indexes in Parquet rather than using the particular implementation we have in the example. So I think focusing on the things they would have to do and de-emphasizing what is specific to the distinct values index would help get this point across better

Great suggestion @alamb !

alamb · 2025-07-04T18:19:27Z

content/blog/datafusion-custom-parquet-index.md

+1. Open the Parquet footer and extract `distinct_index_offset`.
+2. Seek to that offset in the file.
+3. Read and validate `IDX1` magic.
+4. Read the 8‑byte length and then the payload.
+5. Reconstruct `DistinctIndex` from newline‑delimited strings.


I suggest cuting out the details of reading the index so

Open the Parquet footer and extract distinct_index_offset.

Read the DistinctIndex using that index

alamb · 2025-07-04T18:20:09Z

content/blog/datafusion-custom-parquet-index.md

+* **Full format compatibility** with standard tools.
+* **Minimal operational overhead**—no special catalog or sidecar management.
+
+This technique illustrates how Parquet’s extensibility can be harnessed for powerful, lightweight indexing, all within the existing format. Give it a try in your next DataFusion project!


This might be a good place to also mention that parquet itself has many more optimization opportunitues (sort, row group size, etc) that is currently in the introduction

Thank you @alamb for this good suggestion!

zhuqi-lucas · 2025-07-05T03:48:15Z

https://docs.google.com/presentation/d/1aFjTLEDJyDqzFZHgcmRxecCvLKKXV2OvyEpTQFCNZPw/edit?slide=id.g33d7337a5a0_0_85

Thank you @alamb for review and great suggestions! I will try to address today, and feel free to edit this blog and correct me if i am missing anything, thanks!

Co-authored-by: Andrew Lamb <[email protected]>

…-site into custom_index_blog

zhuqi-lucas · 2025-07-05T05:15:33Z

content/blog/datafusion-custom-parquet-index.md

+---
+
+## 1. Parquet 101: File Anatomy & Native Pruning Hooks
+TODO add image here?


@alamb I tried to add the image, but it seems not showing well for my local preview, i am not sure why, so i add todo here...

zhuqi-lucas · 2025-07-05T05:16:44Z

Thank you @alamb ! Addressed comments for the first round, but the image still not add to the content due to it not showing well in my local.

alamb · 2025-07-05T14:47:57Z

Thanks @zhuqi-lucas -- I will keep looking at this later today

JigaoLuo · 2025-07-05T15:07:59Z

Hi @zhuqi-lucas, I've gone through the blog twice, and it looks great overall. I just have one very small nitpick above.

Regarding the content: One suggestion would be to include a reference to the fact that there has been some criticism and attempts to incorporate HyperLogLog into Parquet, as mentioned here apache/datafusion#16374 (comment).

(I also have a few personal questions about the new index itself, but I'll post them on the Issue page instead of here.)

zhuqi-lucas · 2025-07-05T15:52:05Z

Thanks @zhuqi-lucas -- I will keep looking at this later today

Thank you @alamb !

zhuqi-lucas · 2025-07-05T15:52:53Z

Hi @zhuqi-lucas, I've gone through the blog twice, and it looks great overall. I just have one very small nitpick above.

Regarding the content: One suggestion would be to include a reference to the fact that there has been some criticism and attempts to incorporate HyperLogLog into Parquet, as mentioned here apache/datafusion#16374 (comment).

(I also have a few personal questions about the new index itself, but I'll post them on the Issue page instead of here.)

Thank you @JigaoLuo for good point, i am working on some urgent bug fixes, will try to add your good suggestions soon! Thanks!

alamb · 2025-07-05T20:00:37Z

Thanks -- I am going to spend an hour or so taking a pass through this blog trying to get the formatting to work out

So exciting

JigaoLuo · 2025-07-05T20:37:09Z

Regarding my impression during reading: "the Embedded Index is just a hashset to speed up scans, which is an overhead to Parquet." as mentioned as a follow-up here: #79 (comment)

If other readers also has the same impression, it might unintentionally limit how readers perceive its potential of the Embedded Index. To address this, we could consider adding a short Outlook section (either at the beginning or the end of the blog) to explicitly highlight what the Embedded Index is capable of. It’s not just a hashset for pruning; in principle, it could support a wide range of use cases. Use cases are also discussed here: apache/datafusion#16374 (comment)

I’d be happy to help draft such an Outlook section, pending confirmation from your side.

alamb · 2025-07-05T21:10:49Z

I just pushed a commit that reworked the intro a bit and started filling out the background

@JigaoLuo the outlook section you describe sounds great. I envision it right after the

## 1. Parquet 101: File Anatomy & Standard Index Structures

Section

Perhaps like

## 2. Extending Parquet with Special Indexes

(this is where figure 2 goes and where we will explain how to embed a custom index).
So it makes a lot of sense to mention here the potential usecases (and that the index can be written after each row group or at the end of the file, and it can have information for each row group, individual row groups, columns, etc, whatever you want

I would also be interested to hear what @zhuqi-lucas thinks

comphead · 2025-07-07T16:17:57Z

Thanks @zhuqi-lucas @JigaoLuo @alamb
Added some possible minor improvements

djanderson · 2025-07-07T19:43:59Z

content/blog/2025-07-14-user-defined-parquet-indexes.md

+  </tr>
+</table>
+
+The distinct value index will contain the values `foo`, `bar`, and `baz`. Using traditional min/max statistics would store the minimum (`bar`) and maximum (`foo`) values, which would not allow quickly skipping this file for a query like `SELECT * FROM t WHERE Category = 'bas'` as `bas` is between `bar` and `foo`.


Just wanted to confirm you meant to use bas here and not baz.

i did (as that would result in skipping the file) but it is not clear from the text. I will clarify

content/blog/2025-07-14-user-defined-parquet-indexes.md

Co-authored-by: Oleks V <[email protected]>

content/blog/2025-07-14-user-defined-parquet-indexes.md

zhuqi-lucas · 2025-07-08T06:14:51Z

I made some changes based latest comments from folks.

FYI @alamb , please correct me if i made some wrong changes, thanks a lot!

comphead · 2025-07-08T17:51:03Z

Appreciate if anyone can tell if its possible to read the blog draft compiled with formatting?

kevinjqliu · 2025-07-08T18:14:46Z

I can render it locally. also #86 should make local dev easier

kevinjqliu

LGTM! Added a comment about one wrong hyperlink

content/blog/2025-07-14-user-defined-parquet-indexes.md

alamb · 2025-07-09T00:46:40Z

I made some changes based latest comments from folks.

FYI @alamb , please correct me if i made some wrong changes, thanks a lot!

THank you -- it is looking great. I spent some time obsessing over the wording some more (probably unnecessarily) but I am so stoked about this post I can't really help myself

zhuqi-lucas · 2025-07-09T13:40:07Z

I made some changes based latest comments from folks.
FYI @alamb , please correct me if i made some wrong changes, thanks a lot!

THank you -- it is looking great. I spent some time obsessing over the wording some more (probably unnecessarily) but I am so stoked about this post I can't really help myself

Thank you @alamb , it looks great!

alamb · 2025-07-14T13:22:39Z

Thanks again everyone -- now time to make some noise on the social medias

The blog is published here: https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/

Thanks again @JigaoLuo and @zhuqi-lucas -- I think this post will become an important part of the parquet conversation

JigaoLuo · 2025-07-14T17:14:14Z

@zhuqi-lucas @alamb Thanks. I’ll also try to share it on LinkedIn. Would it be okay if I make a copy of your post and include my affiliation (Systems Group @ TU Darmstadt)?

alamb · 2025-07-14T20:08:24Z

@zhuqi-lucas @alamb Thanks. I’ll also try to share it on LinkedIn. Would it be okay if I make a copy of your post and include my affiliation (Systems Group @ TU Darmstadt)?

Yes of course.

Perhaps you could make a PR update the post itself. To do so you could make a PR to modify https://github.com/apache/datafusion-site/blob/main/content/blog/2025-03-20-parquet-pruning.md and then tag me for a review

We could also add an "about the authors" section to the post itself. For example the "About the Authors" section from https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-one/ is from

datafusion-site/content/blog/2025-06-15-optimizing-sql-dataframes-part-one.md

Line 217 in 61aa76e

# About the Authors

zhuqi-lucas · 2025-07-15T03:32:23Z

@zhuqi-lucas @alamb Thanks. I’ll also try to share it on LinkedIn. Would it be okay if I make a copy of your post and include my affiliation (Systems Group @ TU Darmstadt)?

Yes, of course, feel free to do it!

…arquet Files #79 (#89) * update author info Signed-off-by: Jigao Luo <[email protected]> * update lucas Signed-off-by: Jigao Luo <[email protected]> * Reduce optimizer focus for Andrew, add affiliations to byline * fix footnot * update myself Signed-off-by: Jigao Luo <[email protected]> --------- Signed-off-by: Jigao Luo <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>

zhuqi-lucas added 2 commits July 4, 2025 15:14

draft blog for datafusion custom parquet index

392e7cd

polish blog

7b65248

zhuqi-lucas mentioned this pull request Jul 4, 2025

Blog Post for Accelerating Query Processing with Specialized Indexes apache/datafusion#16372

Closed

polish to add why we don't need new format

c60bb82

Add clickbench rewriten case to improve parquet is good enough

a00985c

alamb reviewed Jul 4, 2025

View reviewed changes

zhuqi-lucas and others added 7 commits July 5, 2025 11:48

Update content/blog/datafusion-custom-parquet-index.md

1844099

Co-authored-by: Andrew Lamb <[email protected]>

start addressing comments

7b0a17f

continue addressing comments

8b121e2

continue polish doc

9723b1b

address comments

06e582c

address comments

1451714

Merge branch 'custom_index_blog' of github.com:zhuqi-lucas/datafusion…

a469a79

…-site into custom_index_blog

zhuqi-lucas commented Jul 5, 2025

View reviewed changes

alamb added 2 commits July 5, 2025 17:03

Add frontmatter and images, start filling out parquet background

052cbc2

tweak

e0d1e79

djanderson reviewed Jul 7, 2025

View reviewed changes

content/blog/2025-07-14-user-defined-parquet-indexes.md Show resolved Hide resolved

Update content/blog/2025-07-14-user-defined-parquet-indexes.md

63fdae3

Co-authored-by: Oleks V <[email protected]>

2010YOUY01 reviewed Jul 8, 2025

View reviewed changes

content/blog/2025-07-14-user-defined-parquet-indexes.md Show resolved Hide resolved

zhuqi-lucas added 3 commits July 8, 2025 13:59

Address comment

d1001a6

address comments

134207f

address comments

ae39412

kevinjqliu approved these changes Jul 8, 2025

View reviewed changes

content/blog/2025-07-14-user-defined-parquet-indexes.md Outdated Show resolved Hide resolved

Apply suggestions from code review

b278214

alamb approved these changes Jul 8, 2025

View reviewed changes

alamb added 5 commits July 8, 2025 20:30

Hone motivating example text

00beea9

Hone motivating example text

9798bae

hone

d61567f

Add note in example about files_to_scan

3ccfeb3

more tweaks

019d236

obsess

7add066

alamb mentioned this pull request Jul 11, 2025

DataFusion 48.0.0 blog post #84

Merged

alamb merged commit 61aa76e into apache:main Jul 14, 2025
1 check passed

JigaoLuo mentioned this pull request Jul 15, 2025

[Update author info] Blog: Embedding User-Defined Indexes in Apache Parquet Files #79 #89

Merged

		@@ -0,0 +1,232 @@
		## Accelerating Query Processing in DataFusion with Embedded Parquet Indexes

		It’s a common misconception that Parquet can only deliver basic Min/Max pruning and Bloom filters—and that adding anything “smarter” requires inventing a whole new file format. In fact, Parquet’s design already lets you embed custom indexing data inside the file (via unused footer metadata and byte regions) without breaking compatibility. In this post, we’ll show how DataFusion can leverage a compact distinct‑value index written directly into Parquet files—preserving complete interchangeability with other tools—while enabling ultra‑fast file‑level pruning.


		It’s a common misconception that Parquet can only deliver basic Min/Max pruning and Bloom filters—and that adding anything “smarter” requires inventing a whole new file format. In fact, Parquet’s design already lets you embed custom indexing data inside the file (via unused footer metadata and byte regions) without breaking compatibility. In this post, we’ll show how DataFusion can leverage a compact distinct‑value index written directly into Parquet files—preserving complete interchangeability with other tools—while enabling ultra‑fast file‑level pruning.

		And besides the custom index, a straightforward rewritten parquet file can have good improvement also. For example, rewriting ClickBench partitioned dataset with better settings* (not resorting) improves

Blog: Embedding User-Defined Indexes in Apache Parquet Files #79

Blog: Embedding User-Defined Indexes in Apache Parquet Files #79

Uh oh!

Conversation

zhuqi-lucas commented Jul 4, 2025 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhuqi-lucas commented Jul 4, 2025

Uh oh!

2010YOUY01 commented Jul 4, 2025

Uh oh!

zhuqi-lucas commented Jul 4, 2025

Uh oh!

alamb commented Jul 4, 2025

Uh oh!

zhuqi-lucas commented Jul 4, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhuqi-lucas commented Jul 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhuqi-lucas commented Jul 5, 2025

Uh oh!

alamb commented Jul 5, 2025

Uh oh!

JigaoLuo commented Jul 5, 2025

Uh oh!

zhuqi-lucas commented Jul 5, 2025

Uh oh!

zhuqi-lucas commented Jul 5, 2025

Uh oh!

alamb commented Jul 5, 2025

Uh oh!

JigaoLuo commented Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Jul 5, 2025

Uh oh!

comphead commented Jul 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zhuqi-lucas commented Jul 8, 2025

Uh oh!

comphead commented Jul 8, 2025

Uh oh!

zhuqi-lucas commented Jul 4, 2025 •

edited by alamb

Loading

JigaoLuo commented Jul 5, 2025 •

edited

Loading

alamb commented Jul 9, 2025 •

edited

Loading

alamb commented Jul 14, 2025 •

edited

Loading