Torchscript execution failures (hangs, access violation, Fatal error. Internal CLR fatal error. (0x80131506) )

Repeated execution of Torchscript modules leads to consistent failures of one of these varieties, apparently due to memory corruption (within about 30 seconds of runtime).

This has now been narrowed down to a fairly minimal repro. This hangs or gives CLR fatal error within about 20 seconds, repeatedly running a medium sized (26mm parameter) neural network and interleaving with exercising .NET garbage allocation.

This is running latest version of Torchsharp (0.100.3) on Windows 11 with CUDA 12.2. The TorchScript file is too large to easily upload (56MB) and possibly this would happen with any TorchScript module.

```
    public static void ReplicateFailure()
    {
      var module = torch.jit.load<Tensor, (Tensor, Tensor, Tensor, Tensor)>(@"c:\temp\fail.ts", DeviceType.CUDA, 0).to(ScalarType.Float16);

      for (int bs = 1; bs < 133; bs += 3)
      {
        Tensor input = torch.tensor(new float[bs * 64 * 135], ScalarType.Float16, new Device("cuda:0"), false)
                            .reshape(new long[] { bs, 64, 135 });
        for (int j = 0; j < 100; j++)
        {
          using (var dx = torch.NewDisposeScope())
          {
            module.call(input);
          }
          ExerciseGC(j);
        }
        Console.WriteLine(bs + " done loop this batch size ");
      }
    }

    public static void ExerciseGC(int index)
    {
      Console.WriteLine("in " + index);
      object[] objs = new object[20_000];
      for (int i = 0; i < 20_000; i += 999)
      {
        objs[i] = new byte[i * 10];
      }
      Console.WriteLine("out");
    }

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Torchscript execution failures (hangs, access violation, Fatal error. Internal CLR fatal error. (0x80131506) ) #1047

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Torchscript execution failures (hangs, access violation, Fatal error. Internal CLR fatal error. (0x80131506) ) #1047

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions