-
Notifications
You must be signed in to change notification settings - Fork 606
Add ability to sample dataset #12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pkg/workloads/spark_job/spark_job.py
Outdated
if max_rows == full_dataset_size: | ||
return ingest_df | ||
if subset_config["shuffle"]: | ||
ingest_df = ingest_df.sample(fraction=fraction, seed=subset_config["seed"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, research if using fraction=1.0
reduces efficiency. If it does hurt, we should either bump the fraction a little to avoid undersampling, or add the word "roughly" in the docs. If it does work, then the logger.info
below doesn't need "at most" :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an extra buffer to sampling but I think we should still have at most to cover randomness
0b57993
to
8514887
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
e7c31f5
to
643dddd
Compare
Checklist:
cx refresh
in an example folder)./build/test.sh
)cx init