-
Notifications
You must be signed in to change notification settings - Fork 18
Blog: Embedding User-Defined Indexes in Apache Parquet Files #79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I am not expert for blog, welcome folks to polish it together, thanks a lot! cc @alamb |
|
This post is great, I find the content easy to follow. I have a suggestion for the first paragraph though: perhaps we should emphasize the motivation more clearly at the beginning. I think @alamb 's point in the YouTube video is particularly compelling — we don’t need to invent a new file format to support additional indexing. Instead, we can extend Parquet with custom indexes without compromising the file format’s interchangeability. |
Thank you @2010YOUY01 for review, good point, in latest version, i added the point that we don't need a new format, parquet itself is very good. |
|
This is amazing -- thank you @zhuqi-lucas and @2010YOUY01 -- I will review this asap, but as today is a holiday in the US I may not have a chance to do so until tomorrow. |
|
Thank you @alamb , i will keep polishing it before you reviewing! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much @zhuqi-lucas
I left some "big picture" comments - the main one is to suggest we structure this post follow the "low key technical evangelism" style:
- Teach the readers something general (in this case how parquet files are laid out and what the standard index structures are)
- Explain how to use DataFusion to do something cool with this tech (in this case make a custom index)
I made some diagrams to go along with the background parquet section. If you like them I can push them into this PR if you like and work on the background section
https://docs.google.com/presentation/d/1aFjTLEDJyDqzFZHgcmRxecCvLKKXV2OvyEpTQFCNZPw/edit?slide=id.g33d7337a5a0_0_85 (Happy to give you edit access too -- just request on the slides and i will do so)
| * **Risks synchronization issues:** Removing or renaming one file breaks the index. | ||
| * **Reduces portability:** Harder to share or move Parquet data when the index is external. | ||
|
|
||
| Meanwhile, critics of Parquet’s extensibility point to the lack of a *standard* way to embed auxiliary data (see Amudai). But in practice, Parquet tolerates unknown content gracefully: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a link to the amudai docs that might be good to include: https://github.com/microsoft/amudai/blob/main/docs/spec/src/what_about_parquet.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion @alamb !
| @@ -0,0 +1,232 @@ | |||
| ## Accelerating Query Processing in DataFusion with Embedded Parquet Indexes | |||
|
|
|||
| It’s a common misconception that Parquet can only deliver basic Min/Max pruning and Bloom filters—and that adding anything “smarter” requires inventing a whole new file format. In fact, Parquet’s design already lets you embed custom indexing data *inside* the file (via unused footer metadata and byte regions) without breaking compatibility. In this post, we’ll show how DataFusion can leverage a **compact distinct‑value index** written directly into Parquet files—preserving complete interchangeability with other tools—while enabling ultra‑fast file‑level pruning. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great introduction.
I suggest we also add some background on Parquet in general to make this post more self contained (I can do this too)
Something like
" In this post, we’ll briefly review the Apache Parquet file format, explain how arbitrary indexes can be stored in Parquet files files, and then show how to use Apache DataFusion to store and use a custom index in Parquet files, all while preserving complete interchangeability with other tools."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion @alamb !
|
|
||
| It’s a common misconception that Parquet can only deliver basic Min/Max pruning and Bloom filters—and that adding anything “smarter” requires inventing a whole new file format. In fact, Parquet’s design already lets you embed custom indexing data *inside* the file (via unused footer metadata and byte regions) without breaking compatibility. In this post, we’ll show how DataFusion can leverage a **compact distinct‑value index** written directly into Parquet files—preserving complete interchangeability with other tools—while enabling ultra‑fast file‑level pruning. | ||
|
|
||
| And besides the custom index, a straightforward rewritten parquet file can have good improvement also. For example, rewriting ClickBench partitioned dataset with better settings* (not resorting) improves |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @JigaoLuo and @XiangpengHao have been discussing this topic as well here:
XiangpengHao/liquid-cache#227 -- we could perhaps direct readers there for mroe information and insight
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion @alamb !
| 4. Demonstrate end‑to‑end examples (including DuckDB compatibility) using code from | ||
| [`parquet_embedded_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_embedded_index.rs). | ||
|
|
||
| > **Prerequisite:** this example requires the new “buffered write” API in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should phrase this as a release version that peple can see (arrow-rs 55.2.0) -- also, I think it should be put closter to the actual code example (it doesn't need to be in the introduction)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point @alamb !
|
|
||
| --- | ||
|
|
||
| ## High‑Level Design |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we keep the post at a higher level here and omit the details of the special index's structure (and refer readers to the example instead).
In my mind the main points we are trying to get across in the article are:
- Parquet is extensible with custom indexes
- You can use DataFusion to write and read them
The idea is that readers will want to put their own special indexes in Parquet rather than using the particular implementation we have in the example. So I think focusing on the things they would have to do and de-emphasizing what is specific to the distinct values index would help get this point across better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion @alamb !
| 1. Open the Parquet footer and extract `distinct_index_offset`. | ||
| 2. Seek to that offset in the file. | ||
| 3. Read and validate `IDX1` magic. | ||
| 4. Read the 8‑byte length and then the payload. | ||
| 5. Reconstruct `DistinctIndex` from newline‑delimited strings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest cuting out the details of reading the index so
- Open the Parquet footer and extract
distinct_index_offset. - Read the
DistinctIndexusing that index
| * **Full format compatibility** with standard tools. | ||
| * **Minimal operational overhead**—no special catalog or sidecar management. | ||
|
|
||
| This technique illustrates how Parquet’s extensibility can be harnessed for powerful, lightweight indexing, all within the existing format. Give it a try in your next DataFusion project! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a good place to also mention that parquet itself has many more optimization opportunitues (sort, row group size, etc) that is currently in the introduction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alamb for this good suggestion!
|
Thank you @alamb for review and great suggestions! I will try to address today, and feel free to edit this blog and correct me if i am missing anything, thanks! |
Co-authored-by: Andrew Lamb <[email protected]>
…-site into custom_index_blog
| --- | ||
|
|
||
| ## 1. Parquet 101: File Anatomy & Native Pruning Hooks | ||
| TODO add image here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb I tried to add the image, but it seems not showing well for my local preview, i am not sure why, so i add todo here...
|
Thank you @alamb ! Addressed comments for the first round, but the image still not add to the content due to it not showing well in my local. |
|
Thanks @zhuqi-lucas -- I will keep looking at this later today |
|
Hi @zhuqi-lucas, I've gone through the blog twice, and it looks great overall. I just have one very small nitpick above. Regarding the content: One suggestion would be to include a reference to the fact that there has been some criticism and attempts to incorporate HyperLogLog into Parquet, as mentioned here apache/datafusion#16374 (comment). (I also have a few personal questions about the new index itself, but I'll post them on the Issue page instead of here.) |
Thank you @alamb ! |
Thank you @JigaoLuo for good point, i am working on some urgent bug fixes, will try to add your good suggestions soon! Thanks! |
|
Thanks -- I am going to spend an hour or so taking a pass through this blog trying to get the formatting to work out So exciting |
|
Regarding my impression during reading: "the Embedded Index is just a hashset to speed up scans, which is an overhead to Parquet." as mentioned as a follow-up here: #79 (comment) If other readers also has the same impression, it might unintentionally limit how readers perceive its potential of the Embedded Index. To address this, we could consider adding a short Outlook section (either at the beginning or the end of the blog) to explicitly highlight what the Embedded Index is capable of. It’s not just a hashset for pruning; in principle, it could support a wide range of use cases. Use cases are also discussed here: apache/datafusion#16374 (comment) I’d be happy to help draft such an Outlook section, pending confirmation from your side. |
|
I just pushed a commit that reworked the intro a bit and started filling out the background
@JigaoLuo the outlook section you describe sounds great. I envision it right after the Section Perhaps like (this is where figure 2 goes and where we will explain how to embed a custom index). I would also be interested to hear what @zhuqi-lucas thinks |
|
Thanks @zhuqi-lucas @JigaoLuo @alamb |
| </tr> | ||
| </table> | ||
|
|
||
| The distinct value index will contain the values `foo`, `bar`, and `baz`. Using traditional min/max statistics would store the minimum (`bar`) and maximum (`foo`) values, which would not allow quickly skipping this file for a query like `SELECT * FROM t WHERE Category = 'bas'` as `bas` is between `bar` and `foo`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to confirm you meant to use bas here and not baz.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i did (as that would result in skipping the file) but it is not clear from the text. I will clarify
Co-authored-by: Oleks V <[email protected]>
|
I made some changes based latest comments from folks. FYI @alamb , please correct me if i made some wrong changes, thanks a lot! |
|
Appreciate if anyone can tell if its possible to read the blog draft compiled with formatting? |
|
I can render it locally. also #86 should make local dev easier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Added a comment about one wrong hyperlink
THank you -- it is looking great. I spent some time obsessing over the wording some more (probably unnecessarily) but I am so stoked about this post I can't really help myself |
Thank you @alamb , it looks great! |
|
Thanks again everyone -- now time to make some noise on the social medias The blog is published here: https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/ Thanks again @JigaoLuo and @zhuqi-lucas -- I think this post will become an important part of the parquet conversation |
|
@zhuqi-lucas @alamb Thanks. I’ll also try to share it on LinkedIn. Would it be okay if I make a copy of your post and include my affiliation (Systems Group @ TU Darmstadt)? |
Yes of course. Perhaps you could make a PR update the post itself. To do so you could make a PR to modify https://github.com/apache/datafusion-site/blob/main/content/blog/2025-03-20-parquet-pruning.md and then tag me for a review We could also add an "about the authors" section to the post itself. For example the "About the Authors" section from https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-one/ is from
|
Yes, of course, feel free to do it! |
…arquet Files #79 (#89) * update author info Signed-off-by: Jigao Luo <[email protected]> * update lucas Signed-off-by: Jigao Luo <[email protected]> * Reduce optimizer focus for Andrew, add affiliations to byline * fix footnot * update myself Signed-off-by: Jigao Luo <[email protected]> --------- Signed-off-by: Jigao Luo <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>


Try to blog our work for the custom parquet example for datafusion:
Add an example of embedding indexes inside a parquet file datafusion#16395
Closes Blog Post for Accelerating Query Processing with Specialized Indexes datafusion#16372
This is the initial draft version, we need to polish it.