Skip to content

Productionize the dataset we are using for BackendBench #57

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 35 commits into from

Conversation

PaliC
Copy link
Contributor

@PaliC PaliC commented Jul 31, 2025

This looks like a much bigger PR than it actually is. This is most of the work we need to do on the repo end for #44

This PR

  1. Adds a dataloaders folder to support loading things from parquet files, huggingface urls, and trace files (BackendBench/data_loaders.py)
  2. Creates a script that let's one go back and forth between parquet and trace files (BackendBench/scripts/parquet_trace_converter.py)
  3. Defines a schema for what the final dataset ought to look like
  4. Adds a few filters to help filter out bad inputs (in this case ops we likely don't want to benchmark because they are fill or view). This should be scalable to add more filters like outputs being close to zero or runtime is too short.

I think 3 and 4 definitely require the most review.
The schema is described in the comment at the top of BackendBench/scripts/parquet_trace_converter.py

I'd also take a close look at the filters on BackendBench/scripts/dataset_filters.py as this contains a bunch of ops that seem to not be useful in a benchmark, but I'd like a second look.

BackendBench/scripts/parquet_trace_converter.py offers a trace-to-parquet mode and a parquet-to-trace mode. parquet-to-trace mode is self explanatory. trace-to-parquet mode actually creates two parquet files. The first is a "dev" parquet which contains a bunch of extra metadata on the inputs while the final parquet (I refer to as prod) is the one that should be used in benchmarks and is the result of all the filtering.

You can find explanations of the trace files (this can be removed as it should not be permanent) and the argument schema at https://huggingface.co/datasets/GPUMODE/huggingface_op_trace (I will add the parquet schema once it is done and finalized).

The results of creating and uploading a parquet to huggingface - https://huggingface.co/datasets/GPUMODE/huggingface_op_trace

A validation that this works is that this the roundtrip conversion of the tritonbench data trace -> parquet (dev) -> trace
https://www.diffchecker.com/YYiJ43cq/. The differences are attributed to the fact that we do rename the op "aten.sum.SymInt" to "aten.sum.dim_IntList".

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 31, 2025
@PaliC PaliC requested review from msaroufim and bertmaher August 1, 2025 23:26
df = batch.to_pandas()
for _, row in df.iterrows():
op_name = row["op_name"]
if filter is None or any(f in op_name for f in filter):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the intent to have exact matching, prefix matching, something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now (for torchbench) we just search for the string in the op name. This should be fine for now, but if we want to be more rigorous we could do exact matching in another pr

@msaroufim msaroufim self-requested a review August 13, 2025 17:49
@PaliC PaliC marked this pull request as draft August 13, 2025 23:37
@PaliC PaliC requested a review from msaroufim August 15, 2025 02:11
@PaliC PaliC requested a review from msaroufim August 15, 2025 20:42
@PaliC PaliC marked this pull request as ready for review August 15, 2025 20:55
Copy link
Member

@msaroufim msaroufim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you keep the non augmented dataset as a default now, considering we're sprinting I don't want us to change the underlying eval infra too much unless it's to fix bugs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants