Skip to content

Enable bloom filters by default on read #10299

@hiltontj

Description

@hiltontj

Is your feature request related to a problem or challenge?

When reading from parquet files, bloom filters are not enabled by default. It is not immediately obvious that they are not being used when performing queries, so there may be users out there who are not aware that bloom filters in their parquet files are being ignored.

Part of the issue, however, is that the default behaviour looks to be shared between read and write operations.

Describe the solution you'd like

It would be ideal if bloom filters were enabled by default on read. We should be careful, however, as I do not think they should be enabled by default on write, where, depending on how they are configured, their inclusion can be expensive.

Describe alternatives you've considered

Currently, the bloom filters can be enabled, but must be done explicitly. For example, with datafusion-cli, which uses the default configuration, one must enable the setting via the environment, e.g.,

DATAFUSION_EXECUTION_PARQUET_BLOOM_FILTER_ENABLED=true datafusion-cli

or by setting it explicity, e.g.,

SET datafusion.execution.parquet.bloom_filter_enabled=true;

This may not work for everyone, however, since it may cause problems by writing with bloom filters enabled.

Additional context

Bloom filters are disabled by default here: https://github.com/apache/datafusion/blob/37.1.0/datafusion/common/src/config.rs#L398-L399

This setting is ultimately used to prune row groups on read here: https://github.com/apache/datafusion/blob/37.1.0/datafusion/core/src/datasource/physical_plan/parquet/mod.rs#L531-L545

It looks like this setting is also applied on write here: https://github.com/apache/datafusion/blob/37.1.0/datafusion/common/src/file_options/parquet_writer.rs#L68

There is an existing SLT test that explicitly enables this setting when performing a query here: https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/predicates.slt#L509-L547, however, I do not see any tests that are using this setting on write.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions