Skip to content

Add _clone_dim_order portable kernel #12974

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Aug 11, 2025

Conversation

keyprocedure
Copy link
Contributor

@keyprocedure keyprocedure commented Jul 29, 2025

Summary

This is PR 1 of 3 implementing a dim order aware clone op.

Currently, clone ops are removed during export as no-ops, causing memory layout (dim order) changes to be lost. This can cause backend failures, incorrect outputs when ops expect specific layouts, and performance degradation. This set of PRs introduces a dim order aware clone op, _clone_dim_order, which preserves memory layout changes by explicitly storing dim order information. This is implemented by replacing standard clone ops with this variant during export and updating the clone removal transform to preserve clones that change layout.

This PR adds the portable CPU kernel for the _clone_dim_order op, implementing a clone variant that preserves dim order at runtime. The portable kernel validates dtype and layout compatibility, resizes the output tensor if needed, and performs an element wise clone of the tensors.

Note: A future PR will add the ATen kernel for _clone_dim_order.

Related PRs:

  • PR 2: #12971 - Register _clone_dim_order op and map aten.clone
  • PR 3: #12976 - Update RemoveCloneOpsTransform to be dim_order aware

Fixes #12645

Test plan

Added kernel runtime tests to verify:

  • Tensors of all real dtypes are cloned correctly.
  • Failure when input and output tensor shapes mismatch.
  • Failure with unsupported memory formats.
  • Failure when non_blocking=true since the portable kernel only supports blocking data transfer.
  • Dynamic shape outputs are cloned with correct values.
  • Layout conversions are cloned correctly for contiguous to channels_last, channels_last to contiguous, and channels_last is preserved.

All runtime tests pass via:
build-ninja/kernels/test/portable_kernels_test

Copy link

pytorch-bot bot commented Jul 29, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12974

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit e28b9b9 with merge base 2d4533a (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 29, 2025
@keyprocedure
Copy link
Contributor Author

@pytorchbot label "release notes: none"

keyprocedure added a commit to keyprocedure/executorch that referenced this pull request Aug 3, 2025
@keyprocedure keyprocedure changed the title Add portable and ATen kernels for clone_dim_order op Add _clone_dim_order portable kernel Aug 3, 2025
Copy link
Contributor

@Gasoonjia Gasoonjia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keyprocedure

Good start! Thanks for your great work!

Beyond than what we have now, it would be awesome if we can have a runtime test for the new runtime operator. https://github.com/pytorch/executorch/blob/main/kernels/test/op__to_dim_order_copy_test.cpp is the test for to_dim_order_copy and you can use that as an example.

Also please mark this PR as Open instead of Draft if it is ready to review. Thx!

@@ -28,6 +28,14 @@
"_empty_dim_order.out(int[] size, *, int[]? dim_order=None, Tensor(a!) out) -> Tensor(a!)"
)

lib.define(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is ok to leave here since we are gonna need it in the future, but when we talk about adding portable kernels we mainly focus on the kernels in the runtime, specificly under executorch/kernels/portable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, but I needed to register the operator here otherwise the tests I added fail since there is no Python side reference to _clone_dim_order

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, should all the tests for this PR have been on the kernel side and not in test_memory_format_ops_pass.py?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR should only focus on our runtime changes, so no python side will refer to our new operator.

should all the tests for this PR have been on the kernel side and not in test_memory_format_ops_pass.py

Yes absolutely correct! This PR is only for portable kernel and its tests. Sorry for any misleading!

Copy link
Contributor Author

@keyprocedure keyprocedure Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve added the runtime test and all tests passed locally. I couldn’t run the DynamicShapeUnbound test since it depends on SupportedFeatures and supported_features.h doesn’t seem to be generated in OSS builds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please disregard my previous comment about the missing SupportedFeatures dependency, the issue was with my local build setup. All tests pass now.

@keyprocedure keyprocedure marked this pull request as ready for review August 5, 2025 23:46
Copy link
Contributor

@Gasoonjia Gasoonjia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Some minor feedback but majority looks good!

}
}

/* %python
Copy link
Contributor

@Gasoonjia Gasoonjia Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove comments unrelated with tests

void test_dynamic_shape(
const std::vector<int32_t>& out_shape,
enum torch::executor::TensorShapeDynamism dynamism) {
/* %python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

TEST_F(OpDimOrderCloneTest, ContiguousToChannelsLast) {
TensorFactory<ScalarType::Float> tf;

Tensor x = tf.make_with_dimorder(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now x is using contiguous dim order (0,1,2,3) as default. Plz add a comment here for clarify

0.3597, 0.0911, 0.7719, 0.8151, 0.4296, 0.5552},
/*dim_order=*/{0, 2, 3, 1});

Tensor expected = tf.make_with_dimorder(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as here for comment

@keyprocedure
Copy link
Contributor Author

keyprocedure commented Aug 6, 2025

@Gasoonjia Everything runs fine locally but CI failed because of a missing dependency in copy_ops_util.h:
fatal error: 'executorch/kernels/portable/cpu/util/broadcast_util.h' file not found.

This happened after I refactored a function into copy_ops_util.h and added the import.
I added the import to targets.bzl, but CI still fails. Do you have insights into what I could be missing?

@keyprocedure
Copy link
Contributor Author

@Gasoonjia Everything runs fine locally but CI failed because of a missing dependency in copy_ops_util.h: fatal error: 'executorch/kernels/portable/cpu/util/broadcast_util.h' file not found.

This happened after I refactored a function into copy_ops_util.h and added the import. I added the import to targets.bzl, but CI still fails. Do you have insights into what I could be missing?

I think I got it: broadcast_util wasn't being exported, so I added it to exported_deps for copy_ops_util.
Since I was building with CMake earlier, there weren't any dependency failures, but I was able to successfully build all portable kernels locally with buck2 after the fix.

Can we try CI again?

@Gasoonjia
Copy link
Contributor

@keyprocedure so glad you've fixed the issue! Sry for late review. Restart ci.

@keyprocedure
Copy link
Contributor Author

@keyprocedure so glad you've fixed the issue! Sry for late review. Restart ci.

No worries, I appreciate all the support :)

Progress with CI:
The 10 failing tests were all due to a link error:
Action failed: root//examples/portable/executor_runner:executor_runner (cxx_link_executable)
I've registered _clone_dim_order in op_registration_util.bzl and I can successfully build executor_runner locally.

Do you have any recommendations on which targets I should build locally to ensure everything that relies on new ops will build successfully or is there a Docker image available to run CI locally? I've tried to build entire dirs such as //examples/portable/... but run into dep failures unrelated to this PR.

@@ -1329,6 +1329,13 @@ ATEN_OPS = (
"//executorch/kernels/portable/cpu/util:copy_ops_util",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove broadcast_util here right? Since copy_util will depend on it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I'll remove it here and in op__to_dim_order_copy.cpp, then push once CI finishes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I still push the changes for removing the unused broadcast_util dep, or save it for a follow up?

@Gasoonjia
Copy link
Contributor

thansk for your great work! @keyprocedure
I don't think there's one single point target you can built for running CI locally. Thanks for bring it up! I think it is a good suggestion we can work on in the future for better local ci coverage!

@Gasoonjia
Copy link
Contributor

@keyprocedure
Copy link
Contributor Author

BTW Have you tried this main/CONTRIBUTING.md#testing ? @keyprocedure

Thanks for sharing this!

After stopping some warnings from causing the build to fail, I ran the test script and everything passes except the PyTree EmptySpec test, which seems unrelated.

But I'll run this script to validate future changes.

@Gasoonjia
Copy link
Contributor

ci looks good. Stamped!

@Gasoonjia Gasoonjia merged commit 3a02146 into pytorch:main Aug 11, 2025
100 of 102 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: none Do not include this in the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add dim order variant clone operator
2 participants