-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
Is your feature request related to a problem or challenge?
When looking at some Samply profiles of ClickBench queries on my laptop, it appears there are several times where processing stalls due to parsing parquet metadata:
To reproduce, profile using Samply
To reproduce, get the ClickBench dataset
cd benchmarks
./bench.sh data clickbench_1Then run
datafusion-cli -c "SELECT \"WatchID\", \"ClientIP\", COUNT(*) AS c, SUM(\"IsRefresh\"), AVG(\"ResolutionWidth\") FROM 'data/hits.parquet' WHERE \"SearchPhrase\" <> '' GROUP BY \"WatchID\", \"ClientIP\" ORDER BY c DESC LIMIT 10;"Profile wiht samply (you must build datafusion-cli with `--profile profiling):
samply record datafusion-cli -c "SELECT \"WatchID\", \"ClientIP\", COUNT(*) AS c, SUM(\"IsRefresh\"), AVG(\"ResolutionWidth\") FROM 'data/hits.parquet' WHERE \"SearchPhrase\" <> '' GROUP BY \"WatchID\", \"ClientIP\" ORDER BY c DESC LIMIT 10;"
Describe the solution you'd like
I think we should look into caching this meta
There is a bunch of prior art like
Also in theory this API should allow metadata caching:
But I don't think there is a default implementation and it isn't hooked up
Describe alternatives you've considered
What I would suggest doing first is
- Do profiling / confirm you see the same thing
- Make a quick and dirty global parquet metadata cache (just put it into some global variable key on filename)
If you see significant performance improvements with 2, then we can figure out how to get it in for real
Additional context
No response
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request
