-
Notifications
You must be signed in to change notification settings - Fork 212
Closed
Description
Repeated execution of Torchscript modules leads to consistent failures of one of these varieties, apparently due to memory corruption (within about 30 seconds of runtime).
This has now been narrowed down to a fairly minimal repro. This hangs or gives CLR fatal error within about 20 seconds, repeatedly running a medium sized (26mm parameter) neural network and interleaving with exercising .NET garbage allocation.
This is running latest version of Torchsharp (0.100.3) on Windows 11 with CUDA 12.2. The TorchScript file is too large to easily upload (56MB) and possibly this would happen with any TorchScript module.
public static void ReplicateFailure()
{
var module = torch.jit.load<Tensor, (Tensor, Tensor, Tensor, Tensor)>(@"c:\temp\fail.ts", DeviceType.CUDA, 0).to(ScalarType.Float16);
for (int bs = 1; bs < 133; bs += 3)
{
Tensor input = torch.tensor(new float[bs * 64 * 135], ScalarType.Float16, new Device("cuda:0"), false)
.reshape(new long[] { bs, 64, 135 });
for (int j = 0; j < 100; j++)
{
using (var dx = torch.NewDisposeScope())
{
module.call(input);
}
ExerciseGC(j);
}
Console.WriteLine(bs + " done loop this batch size ");
}
}
public static void ExerciseGC(int index)
{
Console.WriteLine("in " + index);
object[] objs = new object[20_000];
for (int i = 0; i < 20_000; i += 999)
{
objs[i] = new byte[i * 10];
}
Console.WriteLine("out");
}
Metadata
Metadata
Assignees
Labels
No labels