Add ability to sample dataset #12

vishalbollu · 2019-02-12T23:58:30Z

Checklist:

Add license header to each new file
Test end to end manually (e.g. rebuild registry/operator and run cx refresh in an example folder)
Run automated tests (./build/test.sh)
Update documentation
Update examples and cx init
Alert team if dev environment changed
Cherry-pick bugfixes into release branches
Delete the branch once it's merged

docs/applications/resources/environments.md

pkg/api/userconfig/environments.go

pkg/workloads/spark_job/spark_job.py

deliahu · 2019-02-14T23:17:40Z

pkg/workloads/spark_job/spark_job.py

+    if max_rows == full_dataset_size:
+        return ingest_df
+    if subset_config["shuffle"]:
+        ingest_df = ingest_df.sample(fraction=fraction, seed=subset_config["seed"])


As discussed, research if using fraction=1.0 reduces efficiency. If it does hurt, we should either bump the fraction a little to avoid undersampling, or add the word "roughly" in the docs. If it does work, then the logger.info below doesn't need "at most" :)

Added an extra buffer to sampling but I think we should still have at most to cover randomness

deliahu

LGTM

vishalbollu requested a review from deliahu February 12, 2019 23:58

deliahu requested changes Feb 14, 2019

View reviewed changes

vishalbollu force-pushed the small-dataset branch from 0b57993 to 8514887 Compare February 15, 2019 23:15

deliahu approved these changes Feb 15, 2019

View reviewed changes

vishalbollu added 5 commits February 19, 2019 14:49

Add ability to sample dataset

cb988bd

Update documentation and modify error message

45669fa

Refactor arg names of specifying subset

5ff1191

Address PR comments

7c58f8b

Add limit to environment id generation

643dddd

vishalbollu force-pushed the small-dataset branch from e7c31f5 to 643dddd Compare February 19, 2019 21:40

vishalbollu merged commit 2754e02 into master Feb 19, 2019

vishalbollu deleted the small-dataset branch February 19, 2019 22:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add ability to sample dataset #12

Add ability to sample dataset #12

Uh oh!

vishalbollu commented Feb 12, 2019 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

deliahu Feb 14, 2019

Uh oh!

vishalbollu Feb 15, 2019

Uh oh!

deliahu left a comment

Uh oh!

Uh oh!

Add ability to sample dataset #12

Add ability to sample dataset #12

Uh oh!

Conversation

vishalbollu commented Feb 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

deliahu Feb 14, 2019

Choose a reason for hiding this comment

Uh oh!

vishalbollu Feb 15, 2019

Choose a reason for hiding this comment

Uh oh!

deliahu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vishalbollu commented Feb 12, 2019 •

edited

Loading