Skip to content

Conversation

@zhuqi-lucas
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas commented Jul 4, 2025

Try to blog our work for the custom parquet example for datafusion:

This is the initial draft version, we need to polish it.

@zhuqi-lucas
Copy link
Contributor Author

I am not expert for blog, welcome folks to polish it together, thanks a lot! cc @alamb

@2010YOUY01
Copy link
Contributor

This post is great, I find the content easy to follow.

I have a suggestion for the first paragraph though: perhaps we should emphasize the motivation more clearly at the beginning. I think @alamb 's point in the YouTube video is particularly compelling — we don’t need to invent a new file format to support additional indexing. Instead, we can extend Parquet with custom indexes without compromising the file format’s interchangeability.

@zhuqi-lucas
Copy link
Contributor Author

This post is great, I find the content easy to follow.

I have a suggestion for the first paragraph though: perhaps we should emphasize the motivation more clearly at the beginning. I think @alamb 's point in the YouTube video is particularly compelling — we don’t need to invent a new file format to support additional indexing. Instead, we can extend Parquet with custom indexes without compromising the file format’s interchangeability.

Thank you @2010YOUY01 for review, good point, in latest version, i added the point that we don't need a new format, parquet itself is very good.

@alamb
Copy link
Contributor

alamb commented Jul 4, 2025

This is amazing -- thank you @zhuqi-lucas and @2010YOUY01 -- I will review this asap, but as today is a holiday in the US I may not have a chance to do so until tomorrow.

@zhuqi-lucas
Copy link
Contributor Author

Thank you @alamb , i will keep polishing it before you reviewing!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @zhuqi-lucas

I left some "big picture" comments - the main one is to suggest we structure this post follow the "low key technical evangelism" style:

  1. Teach the readers something general (in this case how parquet files are laid out and what the standard index structures are)
  2. Explain how to use DataFusion to do something cool with this tech (in this case make a custom index)

I made some diagrams to go along with the background parquet section. If you like them I can push them into this PR if you like and work on the background section

https://docs.google.com/presentation/d/1aFjTLEDJyDqzFZHgcmRxecCvLKKXV2OvyEpTQFCNZPw/edit?slide=id.g33d7337a5a0_0_85 (Happy to give you edit access too -- just request on the slides and i will do so)

Screenshot 2025-07-04 at 2 44 10 PM Screenshot 2025-07-04 at 2 44 13 PM

* **Risks synchronization issues:** Removing or renaming one file breaks the index.
* **Reduces portability:** Harder to share or move Parquet data when the index is external.

Meanwhile, critics of Parquet’s extensibility point to the lack of a *standard* way to embed auxiliary data (see Amudai). But in practice, Parquet tolerates unknown content gracefully:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a link to the amudai docs that might be good to include: https://github.com/microsoft/amudai/blob/main/docs/spec/src/what_about_parquet.md

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion @alamb !

@@ -0,0 +1,232 @@
## Accelerating Query Processing in DataFusion with Embedded Parquet Indexes

It’s a common misconception that Parquet can only deliver basic Min/Max pruning and Bloom filters—and that adding anything “smarter” requires inventing a whole new file format. In fact, Parquet’s design already lets you embed custom indexing data *inside* the file (via unused footer metadata and byte regions) without breaking compatibility. In this post, we’ll show how DataFusion can leverage a **compact distinct‑value index** written directly into Parquet files—preserving complete interchangeability with other tools—while enabling ultra‑fast file‑level pruning.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great introduction.

I suggest we also add some background on Parquet in general to make this post more self contained (I can do this too)

Something like

" In this post, we’ll briefly review the Apache Parquet file format, explain how arbitrary indexes can be stored in Parquet files files, and then show how to use Apache DataFusion to store and use a custom index in Parquet files, all while preserving complete interchangeability with other tools."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion @alamb !


It’s a common misconception that Parquet can only deliver basic Min/Max pruning and Bloom filters—and that adding anything “smarter” requires inventing a whole new file format. In fact, Parquet’s design already lets you embed custom indexing data *inside* the file (via unused footer metadata and byte regions) without breaking compatibility. In this post, we’ll show how DataFusion can leverage a **compact distinct‑value index** written directly into Parquet files—preserving complete interchangeability with other tools—while enabling ultra‑fast file‑level pruning.

And besides the custom index, a straightforward rewritten parquet file can have good improvement also. For example, rewriting ClickBench partitioned dataset with better settings* (not resorting) improves
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @JigaoLuo and @XiangpengHao have been discussing this topic as well here:

XiangpengHao/liquid-cache#227 -- we could perhaps direct readers there for mroe information and insight

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion @alamb !

4. Demonstrate end‑to‑end examples (including DuckDB compatibility) using code from
[`parquet_embedded_index.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_embedded_index.rs).

> **Prerequisite:** this example requires the new “buffered write” API in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should phrase this as a release version that peple can see (arrow-rs 55.2.0) -- also, I think it should be put closter to the actual code example (it doesn't need to be in the introduction)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point @alamb !


---

## High‑Level Design
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we keep the post at a higher level here and omit the details of the special index's structure (and refer readers to the example instead).

In my mind the main points we are trying to get across in the article are:

  1. Parquet is extensible with custom indexes
  2. You can use DataFusion to write and read them

The idea is that readers will want to put their own special indexes in Parquet rather than using the particular implementation we have in the example. So I think focusing on the things they would have to do and de-emphasizing what is specific to the distinct values index would help get this point across better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion @alamb !

Comment on lines 165 to 169
1. Open the Parquet footer and extract `distinct_index_offset`.
2. Seek to that offset in the file.
3. Read and validate `IDX1` magic.
4. Read the 8‑byte length and then the payload.
5. Reconstruct `DistinctIndex` from newline‑delimited strings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest cuting out the details of reading the index so

  1. Open the Parquet footer and extract distinct_index_offset.
  2. Read the DistinctIndex using that index

* **Full format compatibility** with standard tools.
* **Minimal operational overhead**—no special catalog or sidecar management.

This technique illustrates how Parquet’s extensibility can be harnessed for powerful, lightweight indexing, all within the existing format. Give it a try in your next DataFusion project!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a good place to also mention that parquet itself has many more optimization opportunitues (sort, row group size, etc) that is currently in the introduction

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alamb for this good suggestion!

@zhuqi-lucas
Copy link
Contributor Author

https://docs.google.com/presentation/d/1aFjTLEDJyDqzFZHgcmRxecCvLKKXV2OvyEpTQFCNZPw/edit?slide=id.g33d7337a5a0_0_85

Thank you @alamb for review and great suggestions! I will try to address today, and feel free to edit this blog and correct me if i am missing anything, thanks!

---

## 1. Parquet 101: File Anatomy & Native Pruning Hooks
TODO add image here?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb I tried to add the image, but it seems not showing well for my local preview, i am not sure why, so i add todo here...

@zhuqi-lucas
Copy link
Contributor Author

Thank you @alamb ! Addressed comments for the first round, but the image still not add to the content due to it not showing well in my local.

@alamb
Copy link
Contributor

alamb commented Jul 5, 2025

Thanks @zhuqi-lucas -- I will keep looking at this later today

@JigaoLuo
Copy link
Contributor

JigaoLuo commented Jul 5, 2025

Hi @zhuqi-lucas, I've gone through the blog twice, and it looks great overall. I just have one very small nitpick above.

Regarding the content: One suggestion would be to include a reference to the fact that there has been some criticism and attempts to incorporate HyperLogLog into Parquet, as mentioned here apache/datafusion#16374 (comment).

(I also have a few personal questions about the new index itself, but I'll post them on the Issue page instead of here.)

@zhuqi-lucas
Copy link
Contributor Author

Thanks @zhuqi-lucas -- I will keep looking at this later today

Thank you @alamb !

@zhuqi-lucas
Copy link
Contributor Author

Hi @zhuqi-lucas, I've gone through the blog twice, and it looks great overall. I just have one very small nitpick above.

Regarding the content: One suggestion would be to include a reference to the fact that there has been some criticism and attempts to incorporate HyperLogLog into Parquet, as mentioned here apache/datafusion#16374 (comment).

(I also have a few personal questions about the new index itself, but I'll post them on the Issue page instead of here.)

Thank you @JigaoLuo for good point, i am working on some urgent bug fixes, will try to add your good suggestions soon! Thanks!

@alamb
Copy link
Contributor

alamb commented Jul 5, 2025

Thanks -- I am going to spend an hour or so taking a pass through this blog trying to get the formatting to work out

So exciting

@JigaoLuo
Copy link
Contributor

JigaoLuo commented Jul 5, 2025

Regarding my impression during reading: "the Embedded Index is just a hashset to speed up scans, which is an overhead to Parquet." as mentioned as a follow-up here: #79 (comment)

If other readers also has the same impression, it might unintentionally limit how readers perceive its potential of the Embedded Index. To address this, we could consider adding a short Outlook section (either at the beginning or the end of the blog) to explicitly highlight what the Embedded Index is capable of. It’s not just a hashset for pruning; in principle, it could support a wide range of use cases. Use cases are also discussed here: apache/datafusion#16374 (comment)

I’d be happy to help draft such an Outlook section, pending confirmation from your side.

@alamb
Copy link
Contributor

alamb commented Jul 5, 2025

I just pushed a commit that reworked the intro a bit and started filling out the background

Screenshot 2025-07-05 at 5 06 45 PM Screenshot 2025-07-05 at 5 06 41 PM

@JigaoLuo the outlook section you describe sounds great. I envision it right after the

## 1. Parquet 101: File Anatomy & Standard Index Structures

Section

Perhaps like

## 2. Extending Parquet with Special Indexes

(this is where figure 2 goes and where we will explain how to embed a custom index).
So it makes a lot of sense to mention here the potential usecases (and that the index can be written after each row group or at the end of the file, and it can have information for each row group, individual row groups, columns, etc, whatever you want

I would also be interested to hear what @zhuqi-lucas thinks

@comphead
Copy link
Contributor

comphead commented Jul 7, 2025

Thanks @zhuqi-lucas @JigaoLuo @alamb
Added some possible minor improvements

</tr>
</table>

The distinct value index will contain the values `foo`, `bar`, and `baz`. Using traditional min/max statistics would store the minimum (`bar`) and maximum (`foo`) values, which would not allow quickly skipping this file for a query like `SELECT * FROM t WHERE Category = 'bas'` as `bas` is between `bar` and `foo`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to confirm you meant to use bas here and not baz.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i did (as that would result in skipping the file) but it is not clear from the text. I will clarify

@zhuqi-lucas
Copy link
Contributor Author

I made some changes based latest comments from folks.

FYI @alamb , please correct me if i made some wrong changes, thanks a lot!

@comphead
Copy link
Contributor

comphead commented Jul 8, 2025

Appreciate if anyone can tell if its possible to read the blog draft compiled with formatting?

@kevinjqliu
Copy link
Contributor

I can render it locally. also #86 should make local dev easier

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Added a comment about one wrong hyperlink

@alamb
Copy link
Contributor

alamb commented Jul 9, 2025

I made some changes based latest comments from folks.

FYI @alamb , please correct me if i made some wrong changes, thanks a lot!

THank you -- it is looking great. I spent some time obsessing over the wording some more (probably unnecessarily) but I am so stoked about this post I can't really help myself

@zhuqi-lucas
Copy link
Contributor Author

I made some changes based latest comments from folks.
FYI @alamb , please correct me if i made some wrong changes, thanks a lot!

THank you -- it is looking great. I spent some time obsessing over the wording some more (probably unnecessarily) but I am so stoked about this post I can't really help myself

Thank you @alamb , it looks great!

@alamb alamb mentioned this pull request Jul 11, 2025
@alamb alamb merged commit 61aa76e into apache:main Jul 14, 2025
1 check passed
@alamb
Copy link
Contributor

alamb commented Jul 14, 2025

Thanks again everyone -- now time to make some noise on the social medias

The blog is published here: https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/

Thanks again @JigaoLuo and @zhuqi-lucas -- I think this post will become an important part of the parquet conversation

@JigaoLuo
Copy link
Contributor

@zhuqi-lucas @alamb Thanks. I’ll also try to share it on LinkedIn. Would it be okay if I make a copy of your post and include my affiliation (Systems Group @ TU Darmstadt)?

@alamb
Copy link
Contributor

alamb commented Jul 14, 2025

@zhuqi-lucas @alamb Thanks. I’ll also try to share it on LinkedIn. Would it be okay if I make a copy of your post and include my affiliation (Systems Group @ TU Darmstadt)?

Yes of course.

Perhaps you could make a PR update the post itself. To do so you could make a PR to modify https://github.com/apache/datafusion-site/blob/main/content/blog/2025-03-20-parquet-pruning.md and then tag me for a review

We could also add an "about the authors" section to the post itself. For example the "About the Authors" section from https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-one/ is from

@zhuqi-lucas
Copy link
Contributor Author

@zhuqi-lucas @alamb Thanks. I’ll also try to share it on LinkedIn. Would it be okay if I make a copy of your post and include my affiliation (Systems Group @ TU Darmstadt)?

Yes, of course, feel free to do it!

alamb added a commit that referenced this pull request Jul 15, 2025
…arquet Files #79  (#89)

* update author info

Signed-off-by: Jigao Luo <[email protected]>

* update lucas

Signed-off-by: Jigao Luo <[email protected]>

* Reduce optimizer focus for Andrew, add affiliations to byline

* fix footnot

* update myself

Signed-off-by: Jigao Luo <[email protected]>

---------

Signed-off-by: Jigao Luo <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Blog Post for Accelerating Query Processing with Specialized Indexes

7 participants