|
1 | | -Using DQ |
2 | | ------------- |
| 1 | +## Data Quality core module setup and usage |
3 | 2 |
|
4 | | -DQ is written in Scala, and the build is managed with SBT. |
| 3 | +DQ main application is written in Scala, and the build is managed with SBT. |
5 | 4 |
|
6 | | -Before starting: |
7 | | -- Install JDK |
8 | | -- Install SBT |
9 | | -- Install Git |
| 5 | +> **Before starting:** Install JDK, Scala, sbt and Git. |
10 | 6 |
|
11 | | -The steps to getting DQ up and running for development are pretty simple: |
| 7 | +First of all, clone this repository: |
| 8 | +``` |
| 9 | +git clone https://github.com/agile-lab-dev/DataQuality.git |
| 10 | +``` |
12 | 11 |
|
13 | | -- Clone this repository: |
| 12 | +Then you have 2 options: |
| 13 | +- Run DQ on local |
| 14 | +- Create an archive with setup to run in your distributed environment |
14 | 15 |
|
15 | | - `git clone https://github.com/agile-lab-dev/DataQuality.git` |
| 16 | +#### Local run |
16 | 17 |
|
17 | | -- Start DQ. You can either run DQ in local or cluster mode: |
| 18 | +Simply run `DQMasterBatch` class using your IDE or Java tools with the following arguments |
18 | 19 |
|
19 | | - - local: default setting |
20 | | - - cluster: set isLocal = false calling makeSparkContext() in `DQ/utils/DQMainClass` |
| 20 | +- __-a__: Path to application configuration file. |
| 21 | +> **Example:** ./Agile.DataQuality/dq-core/src/main/resources/conf/dev.conf |
21 | 22 |
|
22 | | -- Run DQ. You can either run DQ via scheduled or provided mode (shell): |
| 23 | +- __-c__: Path to run configuration file. |
| 24 | +> **Example:** ./Agile.DataQuality/docs/examples/conf/full-prostprocess-example.conf |
23 | 25 |
|
24 | | - - `run.sh`, takes parameters from command line: |
25 | | - **-n**, Spark job name |
26 | | - **-c**, Path to configuration file |
27 | | - **-r**, Indicates the date at which the DataQuality checks will be performed |
28 | | - **-d**, Specifies whether the application is operating under debugging conditions |
29 | | - **-h**, Path to hadoop configuration |
30 | | ---- |
| 26 | +- __-d__: Run date. |
| 27 | +> **Example:** 2019-01-01 |
| 28 | +
|
| 29 | +- __-l__: _Optional._ Flag to run in local mode. |
| 30 | + |
| 31 | +- __-r__: _Optional._ Flag to repartition sources after reading. |
| 32 | + |
| 33 | +#### Distributed environment |
| 34 | + |
| 35 | +##### Deployment |
| 36 | +Primarily you'll need to deploy your application to the cluster. You can assemble the jar on your own using sbt |
| 37 | + or you can use some of our predefined utilities. |
| 38 | + |
| 39 | +To use our `deploy.sh` script follow the following steps: |
| 40 | +- Setup REMOTE_HOST and REMOTE_USERNAME in the `deploy.sh`. |
| 41 | +- Create an `application.conf` for your environment. |
| 42 | +- Create a directory with the internal directories `bin` and `conf`. In the corresponding directories put your |
| 43 | + run scripts and configuration files. |
| 44 | + > **Tip:** You can use `run-default.sh` as a base for your run script. |
| 45 | +- Link `application.conf` file and directory with run scripts and confs to the correnspontig parameter values |
| 46 | +in the `build.sbt`. |
| 47 | +- Run `deploy.sh` with your parameters. |
| 48 | + |
| 49 | +##### Submitting |
| 50 | +In distributed environment Data Quality application is being treated as a standard Spark Job, submitted |
| 51 | + by `submit.sh` script. |
| 52 | + |
| 53 | +You can submit your job manually to leverage it on a run script. This is completely up to you. |
0 commit comments