|
17 | 17 | under the License. |
18 | 18 | --> |
19 | 19 |
|
20 | | -# Introduction |
21 | | - |
22 | | -We welcome and encourage contributions of all kinds, such as: |
23 | | - |
24 | | -1. Tickets with issue reports of feature requests |
25 | | -2. Documentation improvements |
26 | | -3. Code (PR or PR Review) |
27 | | - |
28 | | -In addition to submitting new PRs, we have a healthy tradition of community members helping review each other's PRs. Doing so is a great way to help the community as well as get more familiar with Rust and the relevant codebases. |
29 | | - |
30 | | -You can find a curated |
31 | | -[good-first-issue](https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) |
32 | | -list to help you get started. |
33 | | - |
34 | | -# Developer's guide |
35 | | - |
36 | | -This section describes how you can get started at developing DataFusion. |
37 | | - |
38 | | -### Windows setup |
39 | | - |
40 | | -```shell |
41 | | -wget https://az792536.vo.msecnd.net/vms/VMBuild_20190311/VirtualBox/MSEdge/MSEdge.Win10.VirtualBox.zip |
42 | | -choco install -y git rustup.install visualcpp-build-tools |
43 | | -git-bash.exe |
44 | | -cargo build |
45 | | -``` |
46 | | - |
47 | | -### Bootstrap environment |
48 | | - |
49 | | -DataFusion is written in Rust and it uses a standard rust toolkit: |
50 | | - |
51 | | -- `cargo build` |
52 | | -- `cargo fmt` to format the code |
53 | | -- `cargo test` to test |
54 | | -- etc. |
55 | | - |
56 | | -Testing setup: |
57 | | - |
58 | | -- `rustup update stable` DataFusion uses the latest stable release of rust |
59 | | -- `git submodule init` |
60 | | -- `git submodule update` |
61 | | - |
62 | | -Formatting instructions: |
63 | | - |
64 | | -- [ci/scripts/rust_fmt.sh](ci/scripts/rust_fmt.sh) |
65 | | -- [ci/scripts/rust_clippy.sh](ci/scripts/rust_clippy.sh) |
66 | | -- [ci/scripts/rust_toml_fmt.sh](ci/scripts/rust_toml_fmt.sh) |
67 | | - |
68 | | -or run them all at once: |
69 | | - |
70 | | -- [dev/rust_lint.sh](dev/rust_lint.sh) |
71 | | - |
72 | | -## Test Organization |
73 | | - |
74 | | -DataFusion has several levels of tests in its [Test |
75 | | -Pyramid](https://martinfowler.com/articles/practical-test-pyramid.html) |
76 | | -and tries to follow [Testing Organization](https://doc.rust-lang.org/book/ch11-03-test-organization.html) in the The Book. |
77 | | - |
78 | | -This section highlights the most important test modules that exist |
79 | | - |
80 | | -### Unit tests |
81 | | - |
82 | | -Tests for the code in an individual module are defined in the same source file with a `test` module, following Rust convention |
83 | | - |
84 | | -### Rust Integration Tests |
85 | | - |
86 | | -There are several tests of the public interface of the DataFusion library in the [tests](https://github.com/apache/arrow-datafusion/tree/master/datafusion/core/tests) directory. |
87 | | - |
88 | | -You can run these tests individually using a command such as |
89 | | - |
90 | | -```shell |
91 | | -cargo test -p datafusion --tests sql_integration |
92 | | -``` |
93 | | - |
94 | | -One very important test is the [sql_integration](https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/tests/sql_integration.rs) test which validates DataFusion's ability to run a large assortment of SQL queries against an assortment of data setups. |
95 | | - |
96 | | -### SQL / Postgres Integration Tests |
97 | | - |
98 | | -The [integration-tests](https://github.com/apache/arrow-datafusion/blob/master/datafusion/integration-tests) directory contains a harness that runs certain queries against both postgres and datafusion and compares results |
99 | | - |
100 | | -#### setup environment |
101 | | - |
102 | | -```shell |
103 | | -export POSTGRES_DB=postgres |
104 | | -export POSTGRES_USER=postgres |
105 | | -export POSTGRES_HOST=localhost |
106 | | -export POSTGRES_PORT=5432 |
107 | | -``` |
108 | | - |
109 | | -#### Install dependencies |
110 | | - |
111 | | -```shell |
112 | | -# Install dependencies |
113 | | -python -m pip install --upgrade pip setuptools wheel |
114 | | -python -m pip install -r integration-tests/requirements.txt |
115 | | - |
116 | | -# setup environment |
117 | | -POSTGRES_DB=postgres POSTGRES_USER=postgres POSTGRES_HOST=localhost POSTGRES_PORT=5432 python -m pytest -v integration-tests/test_psql_parity.py |
118 | | - |
119 | | -# Create |
120 | | -psql -d "$POSTGRES_DB" -h "$POSTGRES_HOST" -p "$POSTGRES_PORT" -U "$POSTGRES_USER" -c 'CREATE TABLE IF NOT EXISTS test ( |
121 | | - c1 character varying NOT NULL, |
122 | | - c2 integer NOT NULL, |
123 | | - c3 smallint NOT NULL, |
124 | | - c4 smallint NOT NULL, |
125 | | - c5 integer NOT NULL, |
126 | | - c6 bigint NOT NULL, |
127 | | - c7 smallint NOT NULL, |
128 | | - c8 integer NOT NULL, |
129 | | - c9 bigint NOT NULL, |
130 | | - c10 character varying NOT NULL, |
131 | | - c11 double precision NOT NULL, |
132 | | - c12 double precision NOT NULL, |
133 | | - c13 character varying NOT NULL |
134 | | -);' |
135 | | - |
136 | | -psql -d "$POSTGRES_DB" -h "$POSTGRES_HOST" -p "$POSTGRES_PORT" -U "$POSTGRES_USER" -c "\copy test FROM '$(pwd)/testing/data/csv/aggregate_test_100.csv' WITH (FORMAT csv, HEADER true);" |
137 | | -``` |
138 | | - |
139 | | -#### Invoke the test runner |
140 | | - |
141 | | -```shell |
142 | | -python -m pytest -v integration-tests/test_psql_parity.py |
143 | | -``` |
144 | | - |
145 | | -## Benchmarks |
146 | | - |
147 | | -### Criterion Benchmarks |
148 | | - |
149 | | -[Criterion](https://docs.rs/criterion/latest/criterion/index.html) is a statistics-driven micro-benchmarking framework used by DataFusion for evaluating the performance of specific code-paths. In particular, the criterion benchmarks help to both guide optimisation efforts, and prevent performance regressions within DataFusion. |
150 | | - |
151 | | -Criterion integrates with Cargo's built-in [benchmark support](https://doc.rust-lang.org/cargo/commands/cargo-bench.html) and a given benchmark can be run with |
152 | | - |
153 | | -``` |
154 | | -cargo bench --bench BENCHMARK_NAME |
155 | | -``` |
156 | | - |
157 | | -A full list of benchmarks can be found [here](./datafusion/benches). |
158 | | - |
159 | | -_[cargo-criterion](https://github.com/bheisler/cargo-criterion) may also be used for more advanced reporting._ |
160 | | - |
161 | | -#### Parquet SQL Benchmarks |
162 | | - |
163 | | -The parquet SQL benchmarks can be run with |
164 | | - |
165 | | -``` |
166 | | - cargo bench --bench parquet_query_sql |
167 | | -``` |
168 | | - |
169 | | -These randomly generate a parquet file, and then benchmark queries sourced from [parquet_query_sql.sql](./datafusion/core/benches/parquet_query_sql.sql) against it. This can therefore be a quick way to add coverage of particular query and/or data paths. |
170 | | - |
171 | | -If the environment variable `PARQUET_FILE` is set, the benchmark will run queries against this file instead of a randomly generated one. This can be useful for performing multiple runs, potentially with different code, against the same source data, or for testing against a custom dataset. |
172 | | - |
173 | | -The benchmark will automatically remove any generated parquet file on exit, however, if interrupted (e.g. by CTRL+C) it will not. This can be useful for analysing the particular file after the fact, or preserving it to use with `PARQUET_FILE` in subsequent runs. |
174 | | - |
175 | | -### Upstream Benchmark Suites |
176 | | - |
177 | | -Instructions and tooling for running upstream benchmark suites against DataFusion can be found in [benchmarks](./benchmarks). |
178 | | - |
179 | | -These are valuable for comparative evaluation against alternative Arrow implementations and query engines. |
180 | | - |
181 | | -## How to add a new scalar function |
182 | | - |
183 | | -Below is a checklist of what you need to do to add a new scalar function to DataFusion: |
184 | | - |
185 | | -- Add the actual implementation of the function: |
186 | | - - [here](datafusion/physical-expr/src/string_expressions.rs) for string functions |
187 | | - - [here](datafusion/physical-expr/src/math_expressions.rs) for math functions |
188 | | - - [here](datafusion/physical-expr/src/datetime_expressions.rs) for datetime functions |
189 | | - - create a new module [here](datafusion/physical-expr/src) for other functions |
190 | | -- In [core/src/physical_plan](datafusion/core/src/physical_plan/functions.rs), add: |
191 | | - - a new variant to `BuiltinScalarFunction` |
192 | | - - a new entry to `FromStr` with the name of the function as called by SQL |
193 | | - - a new line in `return_type` with the expected return type of the function, given an incoming type |
194 | | - - a new line in `signature` with the signature of the function (number and types of its arguments) |
195 | | - - a new line in `create_physical_expr`/`create_physical_fun` mapping the built-in to the implementation |
196 | | - - tests to the function. |
197 | | -- In [core/tests/sql](datafusion/core/tests/sql), add a new test where the function is called through SQL against well known data and returns the expected result. |
198 | | -- In [expr/src/expr_fn.rs](datafusion/expr/src/expr_fn.rs), add: |
199 | | - - a new entry of the `unary_scalar_expr!` macro for the new function. |
200 | | -- In [core/src/logical_plan/mod](datafusion/core/src/logical_plan/mod.rs), add: |
201 | | - - a new entry in the `pub use expr::{}` set. |
202 | | - |
203 | | -## How to add a new aggregate function |
204 | | - |
205 | | -Below is a checklist of what you need to do to add a new aggregate function to DataFusion: |
206 | | - |
207 | | -- Add the actual implementation of an `Accumulator` and `AggregateExpr`: |
208 | | - - [here](datafusion/physical-expr/src/string_expressions.rs) for string functions |
209 | | - - [here](datafusion/physical-expr/src/math_expressions.rs) for math functions |
210 | | - - [here](datafusion/physical-expr/src/datetime_expressions.rs) for datetime functions |
211 | | - - create a new module [here](datafusion/physical-expr/src) for other functions |
212 | | -- In [datafusion/expr/src](datafusion/expr/src/aggregate_function.rs), add: |
213 | | - - a new variant to `AggregateFunction` |
214 | | - - a new entry to `FromStr` with the name of the function as called by SQL |
215 | | - - a new line in `return_type` with the expected return type of the function, given an incoming type |
216 | | - - a new line in `signature` with the signature of the function (number and types of its arguments) |
217 | | - - a new line in `create_aggregate_expr` mapping the built-in to the implementation |
218 | | - - tests to the function. |
219 | | -- In [tests/sql](datafusion/core/tests/sql), add a new test where the function is called through SQL against well known data and returns the expected result. |
220 | | - |
221 | | -## How to display plans graphically |
222 | | - |
223 | | -The query plans represented by `LogicalPlan` nodes can be graphically |
224 | | -rendered using [Graphviz](http://www.graphviz.org/). |
225 | | - |
226 | | -To do so, save the output of the `display_graphviz` function to a file.: |
227 | | - |
228 | | -```rust |
229 | | -// Create plan somehow... |
230 | | -let mut output = File::create("/tmp/plan.dot")?; |
231 | | -write!(output, "{}", plan.display_graphviz()); |
232 | | -``` |
233 | | - |
234 | | -Then, use the `dot` command line tool to render it into a file that |
235 | | -can be displayed. For example, the following command creates a |
236 | | -`/tmp/plan.pdf` file: |
237 | | - |
238 | | -```bash |
239 | | -dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf |
240 | | -``` |
241 | | - |
242 | | -## Specification |
243 | | - |
244 | | -We formalize DataFusion semantics and behaviors through specification |
245 | | -documents. These specifications are useful to be used as references to help |
246 | | -resolve ambiguities during development or code reviews. |
247 | | - |
248 | | -You are also welcome to propose changes to existing specifications or create |
249 | | -new specifications as you see fit. |
250 | | - |
251 | | -Here is the list current active specifications: |
252 | | - |
253 | | -- [Output field name semantic](https://arrow.apache.org/datafusion/specification/output-field-name-semantic.html) |
254 | | -- [Invariants](https://arrow.apache.org/datafusion/specification/invariants.html) |
255 | | - |
256 | | -All specifications are stored in the `docs/source/specification` folder. |
257 | | - |
258 | | -## How to format `.md` document |
259 | | - |
260 | | -We are using `prettier` to format `.md` files. |
261 | | - |
262 | | -You can either use `npm i -g prettier` to install it globally or use `npx` to run it as a standalone binary. Using `npx` required a working node environment. Upgrading to the latest prettier is recommended (by adding `--upgrade` to the `npm` command). |
263 | | - |
264 | | -```bash |
265 | | -$ prettier --version |
266 | | -2.3.0 |
267 | | -``` |
268 | | - |
269 | | -After you've confirmed your prettier version, you can format all the `.md` files: |
270 | | - |
271 | | -```bash |
272 | | -prettier -w {datafusion,datafusion-cli,datafusion-examples,dev,docs}/**/*.md |
273 | | -``` |
| 20 | +See the Contributor Guide: https://arrow.apache.org/datafusion/ or the source under `docs/source/contributor-guide` |
0 commit comments