Skip to content

Commit 48f9b7a

Browse files
authored
separate contributors guide (#3128)
1 parent b1765f7 commit 48f9b7a

File tree

10 files changed

+327
-309
lines changed

10 files changed

+327
-309
lines changed

CONTRIBUTING.md

Lines changed: 1 addition & 254 deletions
Original file line numberDiff line numberDiff line change
@@ -17,257 +17,4 @@
1717
under the License.
1818
-->
1919

20-
# Introduction
21-
22-
We welcome and encourage contributions of all kinds, such as:
23-
24-
1. Tickets with issue reports of feature requests
25-
2. Documentation improvements
26-
3. Code (PR or PR Review)
27-
28-
In addition to submitting new PRs, we have a healthy tradition of community members helping review each other's PRs. Doing so is a great way to help the community as well as get more familiar with Rust and the relevant codebases.
29-
30-
You can find a curated
31-
[good-first-issue](https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22)
32-
list to help you get started.
33-
34-
# Developer's guide
35-
36-
This section describes how you can get started at developing DataFusion.
37-
38-
### Windows setup
39-
40-
```shell
41-
wget https://az792536.vo.msecnd.net/vms/VMBuild_20190311/VirtualBox/MSEdge/MSEdge.Win10.VirtualBox.zip
42-
choco install -y git rustup.install visualcpp-build-tools
43-
git-bash.exe
44-
cargo build
45-
```
46-
47-
### Bootstrap environment
48-
49-
DataFusion is written in Rust and it uses a standard rust toolkit:
50-
51-
- `cargo build`
52-
- `cargo fmt` to format the code
53-
- `cargo test` to test
54-
- etc.
55-
56-
Testing setup:
57-
58-
- `rustup update stable` DataFusion uses the latest stable release of rust
59-
- `git submodule init`
60-
- `git submodule update`
61-
62-
Formatting instructions:
63-
64-
- [ci/scripts/rust_fmt.sh](ci/scripts/rust_fmt.sh)
65-
- [ci/scripts/rust_clippy.sh](ci/scripts/rust_clippy.sh)
66-
- [ci/scripts/rust_toml_fmt.sh](ci/scripts/rust_toml_fmt.sh)
67-
68-
or run them all at once:
69-
70-
- [dev/rust_lint.sh](dev/rust_lint.sh)
71-
72-
## Test Organization
73-
74-
DataFusion has several levels of tests in its [Test
75-
Pyramid](https://martinfowler.com/articles/practical-test-pyramid.html)
76-
and tries to follow [Testing Organization](https://doc.rust-lang.org/book/ch11-03-test-organization.html) in the The Book.
77-
78-
This section highlights the most important test modules that exist
79-
80-
### Unit tests
81-
82-
Tests for the code in an individual module are defined in the same source file with a `test` module, following Rust convention
83-
84-
### Rust Integration Tests
85-
86-
There are several tests of the public interface of the DataFusion library in the [tests](https://github.com/apache/arrow-datafusion/tree/master/datafusion/core/tests) directory.
87-
88-
You can run these tests individually using a command such as
89-
90-
```shell
91-
cargo test -p datafusion --tests sql_integration
92-
```
93-
94-
One very important test is the [sql_integration](https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/tests/sql_integration.rs) test which validates DataFusion's ability to run a large assortment of SQL queries against an assortment of data setups.
95-
96-
### SQL / Postgres Integration Tests
97-
98-
The [integration-tests](https://github.com/apache/arrow-datafusion/blob/master/datafusion/integration-tests) directory contains a harness that runs certain queries against both postgres and datafusion and compares results
99-
100-
#### setup environment
101-
102-
```shell
103-
export POSTGRES_DB=postgres
104-
export POSTGRES_USER=postgres
105-
export POSTGRES_HOST=localhost
106-
export POSTGRES_PORT=5432
107-
```
108-
109-
#### Install dependencies
110-
111-
```shell
112-
# Install dependencies
113-
python -m pip install --upgrade pip setuptools wheel
114-
python -m pip install -r integration-tests/requirements.txt
115-
116-
# setup environment
117-
POSTGRES_DB=postgres POSTGRES_USER=postgres POSTGRES_HOST=localhost POSTGRES_PORT=5432 python -m pytest -v integration-tests/test_psql_parity.py
118-
119-
# Create
120-
psql -d "$POSTGRES_DB" -h "$POSTGRES_HOST" -p "$POSTGRES_PORT" -U "$POSTGRES_USER" -c 'CREATE TABLE IF NOT EXISTS test (
121-
c1 character varying NOT NULL,
122-
c2 integer NOT NULL,
123-
c3 smallint NOT NULL,
124-
c4 smallint NOT NULL,
125-
c5 integer NOT NULL,
126-
c6 bigint NOT NULL,
127-
c7 smallint NOT NULL,
128-
c8 integer NOT NULL,
129-
c9 bigint NOT NULL,
130-
c10 character varying NOT NULL,
131-
c11 double precision NOT NULL,
132-
c12 double precision NOT NULL,
133-
c13 character varying NOT NULL
134-
);'
135-
136-
psql -d "$POSTGRES_DB" -h "$POSTGRES_HOST" -p "$POSTGRES_PORT" -U "$POSTGRES_USER" -c "\copy test FROM '$(pwd)/testing/data/csv/aggregate_test_100.csv' WITH (FORMAT csv, HEADER true);"
137-
```
138-
139-
#### Invoke the test runner
140-
141-
```shell
142-
python -m pytest -v integration-tests/test_psql_parity.py
143-
```
144-
145-
## Benchmarks
146-
147-
### Criterion Benchmarks
148-
149-
[Criterion](https://docs.rs/criterion/latest/criterion/index.html) is a statistics-driven micro-benchmarking framework used by DataFusion for evaluating the performance of specific code-paths. In particular, the criterion benchmarks help to both guide optimisation efforts, and prevent performance regressions within DataFusion.
150-
151-
Criterion integrates with Cargo's built-in [benchmark support](https://doc.rust-lang.org/cargo/commands/cargo-bench.html) and a given benchmark can be run with
152-
153-
```
154-
cargo bench --bench BENCHMARK_NAME
155-
```
156-
157-
A full list of benchmarks can be found [here](./datafusion/benches).
158-
159-
_[cargo-criterion](https://github.com/bheisler/cargo-criterion) may also be used for more advanced reporting._
160-
161-
#### Parquet SQL Benchmarks
162-
163-
The parquet SQL benchmarks can be run with
164-
165-
```
166-
cargo bench --bench parquet_query_sql
167-
```
168-
169-
These randomly generate a parquet file, and then benchmark queries sourced from [parquet_query_sql.sql](./datafusion/core/benches/parquet_query_sql.sql) against it. This can therefore be a quick way to add coverage of particular query and/or data paths.
170-
171-
If the environment variable `PARQUET_FILE` is set, the benchmark will run queries against this file instead of a randomly generated one. This can be useful for performing multiple runs, potentially with different code, against the same source data, or for testing against a custom dataset.
172-
173-
The benchmark will automatically remove any generated parquet file on exit, however, if interrupted (e.g. by CTRL+C) it will not. This can be useful for analysing the particular file after the fact, or preserving it to use with `PARQUET_FILE` in subsequent runs.
174-
175-
### Upstream Benchmark Suites
176-
177-
Instructions and tooling for running upstream benchmark suites against DataFusion can be found in [benchmarks](./benchmarks).
178-
179-
These are valuable for comparative evaluation against alternative Arrow implementations and query engines.
180-
181-
## How to add a new scalar function
182-
183-
Below is a checklist of what you need to do to add a new scalar function to DataFusion:
184-
185-
- Add the actual implementation of the function:
186-
- [here](datafusion/physical-expr/src/string_expressions.rs) for string functions
187-
- [here](datafusion/physical-expr/src/math_expressions.rs) for math functions
188-
- [here](datafusion/physical-expr/src/datetime_expressions.rs) for datetime functions
189-
- create a new module [here](datafusion/physical-expr/src) for other functions
190-
- In [core/src/physical_plan](datafusion/core/src/physical_plan/functions.rs), add:
191-
- a new variant to `BuiltinScalarFunction`
192-
- a new entry to `FromStr` with the name of the function as called by SQL
193-
- a new line in `return_type` with the expected return type of the function, given an incoming type
194-
- a new line in `signature` with the signature of the function (number and types of its arguments)
195-
- a new line in `create_physical_expr`/`create_physical_fun` mapping the built-in to the implementation
196-
- tests to the function.
197-
- In [core/tests/sql](datafusion/core/tests/sql), add a new test where the function is called through SQL against well known data and returns the expected result.
198-
- In [expr/src/expr_fn.rs](datafusion/expr/src/expr_fn.rs), add:
199-
- a new entry of the `unary_scalar_expr!` macro for the new function.
200-
- In [core/src/logical_plan/mod](datafusion/core/src/logical_plan/mod.rs), add:
201-
- a new entry in the `pub use expr::{}` set.
202-
203-
## How to add a new aggregate function
204-
205-
Below is a checklist of what you need to do to add a new aggregate function to DataFusion:
206-
207-
- Add the actual implementation of an `Accumulator` and `AggregateExpr`:
208-
- [here](datafusion/physical-expr/src/string_expressions.rs) for string functions
209-
- [here](datafusion/physical-expr/src/math_expressions.rs) for math functions
210-
- [here](datafusion/physical-expr/src/datetime_expressions.rs) for datetime functions
211-
- create a new module [here](datafusion/physical-expr/src) for other functions
212-
- In [datafusion/expr/src](datafusion/expr/src/aggregate_function.rs), add:
213-
- a new variant to `AggregateFunction`
214-
- a new entry to `FromStr` with the name of the function as called by SQL
215-
- a new line in `return_type` with the expected return type of the function, given an incoming type
216-
- a new line in `signature` with the signature of the function (number and types of its arguments)
217-
- a new line in `create_aggregate_expr` mapping the built-in to the implementation
218-
- tests to the function.
219-
- In [tests/sql](datafusion/core/tests/sql), add a new test where the function is called through SQL against well known data and returns the expected result.
220-
221-
## How to display plans graphically
222-
223-
The query plans represented by `LogicalPlan` nodes can be graphically
224-
rendered using [Graphviz](http://www.graphviz.org/).
225-
226-
To do so, save the output of the `display_graphviz` function to a file.:
227-
228-
```rust
229-
// Create plan somehow...
230-
let mut output = File::create("/tmp/plan.dot")?;
231-
write!(output, "{}", plan.display_graphviz());
232-
```
233-
234-
Then, use the `dot` command line tool to render it into a file that
235-
can be displayed. For example, the following command creates a
236-
`/tmp/plan.pdf` file:
237-
238-
```bash
239-
dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf
240-
```
241-
242-
## Specification
243-
244-
We formalize DataFusion semantics and behaviors through specification
245-
documents. These specifications are useful to be used as references to help
246-
resolve ambiguities during development or code reviews.
247-
248-
You are also welcome to propose changes to existing specifications or create
249-
new specifications as you see fit.
250-
251-
Here is the list current active specifications:
252-
253-
- [Output field name semantic](https://arrow.apache.org/datafusion/specification/output-field-name-semantic.html)
254-
- [Invariants](https://arrow.apache.org/datafusion/specification/invariants.html)
255-
256-
All specifications are stored in the `docs/source/specification` folder.
257-
258-
## How to format `.md` document
259-
260-
We are using `prettier` to format `.md` files.
261-
262-
You can either use `npm i -g prettier` to install it globally or use `npx` to run it as a standalone binary. Using `npx` required a working node environment. Upgrading to the latest prettier is recommended (by adding `--upgrade` to the `npm` command).
263-
264-
```bash
265-
$ prettier --version
266-
2.3.0
267-
```
268-
269-
After you've confirmed your prettier version, you can format all the `.md` files:
270-
271-
```bash
272-
prettier -w {datafusion,datafusion-cli,datafusion-examples,dev,docs}/**/*.md
273-
```
20+
See the Contributor Guide: https://arrow.apache.org/datafusion/ or the source under `docs/source/contributor-guide`

README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,7 @@ Please see [example usage](https://arrow.apache.org/datafusion/user-guide/exampl
9999

100100
## Roadmap
101101

102-
Please see [Roadmap](docs/source/specification/roadmap.md) for information of where the project is headed.
102+
Please see [Roadmap](docs/source/contributor-guide/roadmap.md) for information of where the project is headed.
103103

104104
## Architecture Overview
105105

@@ -109,10 +109,10 @@ There is no formal document describing DataFusion's architecture yet, but the fo
109109
- (March 2021): The DataFusion architecture is described in _Query Engine Design and the Rust-Based DataFusion in Apache Arrow_: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts [~ 15 minutes in](https://www.youtube.com/watch?v=K6eCAVEk4kU&t=875s)) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
110110
- (February 2021): How DataFusion is used within the Ballista Project is described in \*Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
111111

112-
## User's guide
112+
## User Guide
113113

114114
Please see [User Guide](https://arrow.apache.org/datafusion/) for more information about DataFusion.
115115

116-
## Contribution Guide
116+
## Contributor Guide
117117

118-
Please see [Contribution Guide](CONTRIBUTING.md) for information about contributing to DataFusion.
118+
Please see [Contributor Guide](docs/source/contributor-guide/index.md) for information about contributing to DataFusion.

docs/source/developer-guide/community/communication.md renamed to docs/source/contributor-guide/communication.md

Lines changed: 0 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -69,15 +69,3 @@ The goals of these calls are:
6969
No decisions are made on the call and anything of substance will be discussed on this mailing list or in github issues / google docs.
7070

7171
We will send a summary of all sync ups to the [email protected] mailing list.
72-
73-
## Contributing
74-
75-
Our source code is hosted on
76-
[GitHub](https://github.com/apache/arrow-datafusion). More information on contributing is in
77-
the [Contribution Guide](https://github.com/apache/arrow-datafusion/blob/master/CONTRIBUTING.md)
78-
, and we have curated a [good-first-issue](https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22)
79-
list to help you get started. You can find DataFusion's major designs in docs/source/specification.
80-
81-
We use GitHub issues for maintaining a queue of development work and as the
82-
public record. We often use Google docs, Github issues and pull requests for
83-
quick and small design discussions. For major design change proposals, we encourage you to write a rfc.

0 commit comments

Comments
 (0)