Skip to content

Torchscript execution failures (hangs, access violation, Fatal error. Internal CLR fatal error. (0x80131506) ) #1047

@dje-dev

Description

@dje-dev

Repeated execution of Torchscript modules leads to consistent failures of one of these varieties, apparently due to memory corruption (within about 30 seconds of runtime).

This has now been narrowed down to a fairly minimal repro. This hangs or gives CLR fatal error within about 20 seconds, repeatedly running a medium sized (26mm parameter) neural network and interleaving with exercising .NET garbage allocation.

This is running latest version of Torchsharp (0.100.3) on Windows 11 with CUDA 12.2. The TorchScript file is too large to easily upload (56MB) and possibly this would happen with any TorchScript module.

    public static void ReplicateFailure()
    {
      var module = torch.jit.load<Tensor, (Tensor, Tensor, Tensor, Tensor)>(@"c:\temp\fail.ts", DeviceType.CUDA, 0).to(ScalarType.Float16);

      for (int bs = 1; bs < 133; bs += 3)
      {
        Tensor input = torch.tensor(new float[bs * 64 * 135], ScalarType.Float16, new Device("cuda:0"), false)
                            .reshape(new long[] { bs, 64, 135 });
        for (int j = 0; j < 100; j++)
        {
          using (var dx = torch.NewDisposeScope())
          {
            module.call(input);
          }
          ExerciseGC(j);
        }
        Console.WriteLine(bs + " done loop this batch size ");
      }
    }

    public static void ExerciseGC(int index)
    {
      Console.WriteLine("in " + index);
      object[] objs = new object[20_000];
      for (int i = 0; i < 20_000; i += 999)
      {
        objs[i] = new byte[i * 10];
      }
      Console.WriteLine("out");
    }

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions