Skip to content

Failed SCRIPTRUN with redis cluster #437

@Spartee

Description

@Spartee

Description

Hello, we are trying to investigate running RedisAI on our clusters to perform inference in at scale. We have working examples of running both the Bash and Python clients with a single instance of Redis with RedisAI.

We noticed that the RedisModule_CreateCommand for AI.SCRIPTRUN did not include no-cluster in the string flags, so we assumed that this function would work in cluster mode. Is this assumption correct?

We tried to run AI.SCRIPTRUN on a three node cluster and recieved the following error. The TENSORGET TENSORSET, MODELSET, MODELGET, SCRIPTSET, and SCRIPTGET do work with the redis-cli in cluster mode (e.g. -c)

Client Commands Run

./redis-cli -c -h 10.128.0.133 -x AI.MODELSET mnist TORCH GPU BLOB < ../../code/mnist_cnn.pt
./redis-cli -c -h 10.128.0.133 -x AI.SCRIPTSET script GPU SOURCE < ../../code/data_processing_script.txt
./redis-cli -c -h 10.128.0.133 -x AI.TENSORSET image FLOAT 1 1 28 28 BLOB < ../../data/one.raw
./redis-cli -c -h 10.128.0.133 AI.SCRIPTRUN script pre_process_3ch INPUTS image OUPUTS temp

Hardware

  • Redis Cluster (per node, 3 total)
    • Ivybridge CPU with Tesla K40 GPUs

Software

  • Cudatoolkit - 10.2
  • CUDNN - 7.6.5
  • Redis - 6.0.6
  • Torch - 1.5.0
  • RedisAI - master 3019629

Error dump


8198:C 29 Jul 2020 12:53:28.804 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
8198:C 29 Jul 2020 12:53:28.817 # Redis version=6.0.6, bits=64, commit=12b9de98, modified=1, pid=8198, just started
8198:C 29 Jul 2020 12:53:28.817 # Configuration loaded
8198:M 29 Jul 2020 12:53:28.818 * Increased maximum number of open files to 10032 (it was originally set to 4096).
8198:M 29 Jul 2020 12:53:28.820 # Not listening to IPv6: unsupported
8198:M 29 Jul 2020 12:53:28.831 * Node configuration loaded, I'm 55eb07013d06363c451fb895921af8c4d538818c
8198:M 29 Jul 2020 12:53:28.833 # Not listening to IPv6: unsupported
8198:M 29 Jul 2020 12:53:28.833 * Running mode=cluster, port=6379.
8198:M 29 Jul 2020 12:53:28.834 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
8198:M 29 Jul 2020 12:53:28.834 # Server initialized
8198:M 29 Jul 2020 12:53:28.834 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
8198:M 29 Jul 2020 12:53:28.835 * <ai> Redis version found by RedisAI: 6.0.6 - oss
8198:M 29 Jul 2020 12:53:28.835 * <ai> RedisAI version 999999, git_sha=30196294c6f240e80ff4b64aa00bc54098ec81fb
8198:M 29 Jul 2020 12:53:28.836 * Module 'ai' loaded from /lus/snx11108/spartee/RedisAI/install-gpu/redisai.so
8198:M 29 Jul 2020 12:53:28.836 * Ready to accept connections
8198:M 29 Jul 2020 12:53:33.828 # configEpoch set to 1 via CLUSTER SET-CONFIG-EPOCH
8198:M 29 Jul 2020 12:53:33.840 # IP address for this node updated to 10.128.0.133
8198:M 29 Jul 2020 12:53:38.760 # Cluster state changed: ok
8198:M 29 Jul 2020 13:23:49.047 * Marking node b2df1b600e856bcbc727088271d02a0c77b043bb as failing (quorum reached).
8198:M 29 Jul 2020 13:23:49.064 # Cluster state changed: fail
8198:M 29 Jul 2020 13:24:19.123 * Clear FAIL state for node b2df1b600e856bcbc727088271d02a0c77b043bb: is reachable again and nobody is serving its slots after some time.
8198:M 29 Jul 2020 13:24:19.123 # Cluster state changed: ok


=== REDIS BUG REPORT START: Cut & paste starting from here ===
8198:M 29 Jul 2020 13:46:28.934 # Redis 6.0.6 crashed by signal: 11
8198:M 29 Jul 2020 13:46:28.935 # Crashed running the instruction at: 0x30085b
8198:M 29 Jul 2020 13:46:28.936 # Accessing address: 0x18
8198:M 29 Jul 2020 13:46:28.936 # Failed assertion: <no assertion failed> (<no file>:0)

------ STACK TRACE ------
EIP:
/lus/scratch/spartee/code/../redis-6.0.6/src/redis-server *:6379 [cluster](RM_OpenKey+0x1b)[0x30085b]

Backtrace:
/lus/scratch/spartee/code/../redis-6.0.6/src/redis-server *:6379 [cluster](logStackTrace+0x4d)[0x2d154d]
/lus/scratch/spartee/code/../redis-6.0.6/src/redis-server *:6379 [cluster](sigsegvHandler+0xde)[0x2d1b3e]
/lib64/libpthread.so.0(+0x132d0)[0x7ffff48572d0]
/lus/scratch/spartee/code/../redis-6.0.6/src/redis-server *:6379 [cluster](RM_OpenKey+0x1b)[0x30085b]
/lus/snx11108/spartee/RedisAI/install-gpu/redisai.so(RAI_GetScriptFromKeyspace+0x1a)[0x7ffff7fded2a]
/lus/snx11108/spartee/RedisAI/install-gpu/redisai.so(RedisAI_ScriptRun_RedisCommand+0x9e)[0x7ffff7fd5a7e]
/lus/scratch/spartee/code/../redis-6.0.6/src/redis-server *:6379 [cluster](moduleGetCommandKeysViaAPI+0x6f)[0x2fe64f]
/lus/scratch/spartee/code/../redis-6.0.6/src/redis-server *:6379 [cluster](getNodeByQuery+0xfc)[0x2df5ec]
/lus/scratch/spartee/code/../redis-6.0.6/src/redis-server *:6379 [cluster](processCommand+0x354)[0x289704]
/lus/scratch/spartee/code/../redis-6.0.6/src/redis-server *:6379 [cluster](processInputBuffer+0x227)[0x29c387]
/lus/scratch/spartee/code/../redis-6.0.6/src/redis-server *:6379 [cluster][0x31febc]
/lus/scratch/spartee/code/../redis-6.0.6/src/redis-server *:6379 [cluster](aeProcessEvents+0x243)[0x280a63]
/lus/scratch/spartee/code/../redis-6.0.6/src/redis-server *:6379 [cluster](aeMain+0x1d)[0x280dfd]
/lus/scratch/spartee/code/../redis-6.0.6/src/redis-server *:6379 [cluster](main+0x70c)[0x28cdec]
/lib64/libc.so.6(__libc_start_main+0xea)[0x7ffff44ad34a]
/lus/scratch/spartee/code/../redis-6.0.6/src/redis-server *:6379 [cluster](_start+0x2a)[0x27ce4a]

------ INFO OUTPUT ------
# Server
redis_version:6.0.6
redis_git_sha1:12b9de98
redis_git_dirty:1
redis_build_id:1ed41d4f80e18501
redis_mode:cluster
os:Linux 4.12.14-150.17_5.0.91-cray_ari_c x86_64
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:4.2.1
process_id:8198
run_id:589d480aaa9287b823c1be290e4c25290569dba2
tcp_port:6379
uptime_in_seconds:3180
uptime_in_days:0
hz:10
configured_hz:10
lru_clock:2212868
executable:/lus/scratch/spartee/code/../redis-6.0.6/src/redis-server
config_file:

# Clients
connected_clients:1
client_recent_max_input_buffer:4
client_recent_max_output_buffer:0
blocked_clients:0
tracking_clients:0
clients_in_timeout_table:0

# Memory
used_memory:1492776
used_memory_human:1.42M
used_memory_rss:13660160
used_memory_rss_human:13.03M
used_memory_peak:6259880
used_memory_peak_human:5.97M
used_memory_peak_perc:23.85%
used_memory_overhead:1408416
used_memory_startup:1408344
used_memory_dataset:84360
used_memory_dataset_perc:99.91%
allocator_allocated:1440304
allocator_active:13622272
allocator_resident:13622272
total_system_memory:33699737600
total_system_memory_human:31.39G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
allocator_frag_ratio:9.46
allocator_frag_bytes:12181968
allocator_rss_ratio:1.00
allocator_rss_bytes:0
rss_overhead_ratio:1.00
rss_overhead_bytes:37888
mem_fragmentation_ratio:9.48
mem_fragmentation_bytes:12219856
mem_not_counted_for_evict:0
mem_replication_backlog:0
mem_clients_slaves:0
mem_clients_normal:0
mem_aof_buffer:0
mem_allocator:libc
active_defrag_running:0
lazyfree_pending_objects:0

# Persistence
loading:0
rdb_changes_since_last_save:1
rdb_bgsave_in_progress:0
rdb_last_save_time:1596045208
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:0
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0
module_fork_in_progress:0
module_fork_last_cow_size:0

# Stats
total_connections_received:12
total_commands_processed:22
instantaneous_ops_per_sec:0
total_net_input_bytes:9679038
total_net_output_bytes:111224
instantaneous_input_kbps:0.00
instantaneous_output_kbps:0.00
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
expired_stale_perc:0.00
expired_time_cap_reached_count:0
expire_cycle_cpu_milliseconds:21
evicted_keys:0
keyspace_hits:1
keyspace_misses:0
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0
tracking_total_keys:0
tracking_total_items:0
tracking_total_prefixes:0
unexpected_error_replies:0

# Replication
role:master
connected_slaves:0
master_replid:899c837325930f9b4359c09df1883ae5cc0d0ed2
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:0
second_repl_offset:-1
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:0.949095
used_cpu_user:0.712894
used_cpu_sys_children:0.000000
used_cpu_user_children:0.000000

# Modules
module:name=ai,ver=999999,api=1,filters=0,usedby=[],using=[],options=[]

# Commandstats
cmdstat_cluster:calls=9,usec=1962,usec_per_call=218.00
cmdstat_command:calls=5,usec=2562,usec_per_call=512.40
cmdstat_ai.tensorget:calls=1,usec=23,usec_per_call=23.00
cmdstat_ai.tensorset:calls=1,usec=23,usec_per_call=23.00
cmdstat_info:calls=4,usec=159,usec_per_call=39.75
cmdstat_keys:calls=2,usec=7,usec_per_call=3.50

# Cluster
cluster_enabled:1

# Keyspace
db0:keys=1,expires=0,avg_ttl=0

------ CLIENT LIST OUTPUT ------
id=17 addr=10.128.0.6:33572 fd=11 name= age=0 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=102 qbuf-free=32666 obl=0 oll=0 omem=0 events=r cmd=ai.scriptrun user=default

------ CURRENT CLIENT INFO ------
id=17 addr=10.128.0.6:33572 fd=11 name= age=0 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=102 qbuf-free=32666 obl=0 oll=0 omem=0 events=r cmd=ai.scriptrun user=default
argv[0]: 'AI.SCRIPTRUN'
argv[1]: 'script'
argv[2]: 'pre_process_3ch'
argv[3]: 'INPUTS'
argv[4]: 'image'
argv[5]: 'OUPUTS'
argv[6]: 'temp'

------ REGISTERS ------
8198:M 29 Jul 2020 13:46:28.967 # 
RAX:0000000000000000 RBX:00007fffffff5330
RCX:00007fffffff5328 RDX:0000000000000000
RDI:00007fffffff5368 RSI:00000000005609f0
RBP:00000000005609f0 RSP:00007fffffff52a0
R8 :0000000000000001 R9 :0000000000000002
R10:000000000053ebeb R11:0000000000000008
R12:0000000000000000 R13:000000000029a501
R14:00007fffffff5368 R15:0000000000000001
RIP:000000000030085b EFL:0000000000010246
CSGSFS:002b000000000033
8198:M 29 Jul 2020 13:46:28.967 # (00007fffffff52af) -> 0000000000000000
8198:M 29 Jul 2020 13:46:28.967 # (00007fffffff52ae) -> 0000000000000000
8198:M 29 Jul 2020 13:46:28.967 # (00007fffffff52ad) -> 00007fffffff5348
8198:M 29 Jul 2020 13:46:28.967 # (00007fffffff52ac) -> 0000000000000000
8198:M 29 Jul 2020 13:46:28.968 # (00007fffffff52ab) -> 0000000b00560a80
8198:M 29 Jul 2020 13:46:28.968 # (00007fffffff52aa) -> 0000000000000000
8198:M 29 Jul 2020 13:46:28.968 # (00007fffffff52a9) -> 00007ffff7fd5a7e
8198:M 29 Jul 2020 13:46:28.969 # (00007fffffff52a8) -> 00007fffffff5368
8198:M 29 Jul 2020 13:46:28.969 # (00007fffffff52a7) -> 0000000000000001
8198:M 29 Jul 2020 13:46:28.969 # (00007fffffff52a6) -> 0000000000555980
8198:M 29 Jul 2020 13:46:28.969 # (00007fffffff52a5) -> 00007ffff7fded2a
8198:M 29 Jul 2020 13:46:28.969 # (00007fffffff52a4) -> 0000000000000007
8198:M 29 Jul 2020 13:46:28.970 # (00007fffffff52a3) -> 00007fffffff5368
8198:M 29 Jul 2020 13:46:28.970 # (00007fffffff52a2) -> 00007fffffff5328
8198:M 29 Jul 2020 13:46:28.970 # (00007fffffff52a1) -> 0000000000000000
8198:M 29 Jul 2020 13:46:28.970 # (00007fffffff52a0) -> 00007fffffff5330

------ MODULES INFO OUTPUT ------
# ai_git
ai_git_sha:30196294c6f240e80ff4b64aa00bc54098ec81fb

# ai_load_time_configs
ai_threads_per_queue:1
ai_inter_op_parallelism:0
ai_intra_op_parallelism:0

------ FAST MEMORY TEST ------
8198:M 29 Jul 2020 13:46:28.971 # Bio thread for job type #0 terminated
8198:M 29 Jul 2020 13:46:28.972 # Bio thread for job type #1 terminated
8198:M 29 Jul 2020 13:46:28.972 # Bio thread for job type #2 terminated
*** Preparing to test memory region 35a000 (14815232 bytes)
*** Preparing to test memory region 7ffff19e5000 (8388608 bytes)
*** Preparing to test memory region 7ffff21e6000 (8388608 bytes)
*** Preparing to test memory region 7ffff29e7000 (8388608 bytes)
*** Preparing to test memory region 7ffff31e8000 (8388608 bytes)
*** Preparing to test memory region 7ffff3db8000 (12288 bytes)
*** Preparing to test memory region 7ffff4840000 (16384 bytes)
*** Preparing to test memory region 7ffff4a5f000 (16384 bytes)
*** Preparing to test memory region 7ffff4f7b000 (8192 bytes)
*** Preparing to test memory region 7ffff520b000 (20480 bytes)
*** Preparing to test memory region 7ffff64a8000 (12288 bytes)
*** Preparing to test memory region 7ffff767e000 (86016 bytes)
*** Preparing to test memory region 7ffff7f97000 (40960 bytes)
*** Preparing to test memory region 7ffff7fee000 (4096 bytes)
*** Preparing to test memory region 7ffff7ff6000 (8192 bytes)
*** Preparing to test memory region 7ffff7ffe000 (4096 bytes)
*** Preparing to test memory region 7fffffff6000 (36864 bytes)
.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O.O
Fast memory test PASSED, however your memory can still be broken. Please run a memory test for several hours if possible.

------ DUMPING CODE AROUND EIP ------
Symbol: RM_OpenKey (base: 0x300840)
Module: /lus/scratch/spartee/code/../redis-6.0.6/src/redis-server *:6379 [cluster] (base 0x200000)
$ xxd -r -p /tmp/dump.hex /tmp/dump.bin
$ objdump --adjust-vma=0x300840 -D -b binary -m i386:x86-64 /tmp/dump.bin
------
8198:M 29 Jul 2020 13:46:29.100 # dump of function (hexdump of 155 bytes):
55415741564154534189d74889f54989fec1ea1083e201488b4710488b781841f6c7027514e83621faff4989c44885c0750f31dbe9b3000000e8b222faff4989c4bf78000000e815f6f8ff4889c34c8930498b4610488b40184889430848896b104889efe827f1f9ff4c89631848c743200000000044897b28c7432c0000000048c7436800000000c743700100000041f64630027456418b4e2c41

=== REDIS BUG REPORT END. Make sure to include from START to END. ===

       Please report the crash by opening an issue on github:

           http://github.com/redis/redis/issues

  Suspect RAM error? Use redis-server --test-memory to verify it.

Here is the script we are setting to preprocess MNIST images. (adapted from the imagenet example in the redisai-examples repo)

def pre_process_3ch(image):
    mean = torch.zeros(1).float().to(image.device)
    std = torch.zeros(1).float().to(image.device)
    mean[0] = 0.1307
    std[0] = 0.3081
    mean = mean.unsqueeze(1).unsqueeze(1)
    std = std.unsqueeze(1).unsqueeze(1)
    temp = image.float().div(28).permute(1, 0)
    return temp.sub(mean).div(std).unsqueeze(0)

Here is the model (simple torch MNIST) that we are using

from __future__ import print_function
import argparse
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.optim.lr_scheduler import StepLR


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout2d(0.25)
        self.dropout2 = nn.Dropout2d(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output


def train(args, model, device, train_loader, optimizer, epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % args.log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))
            if args.dry_run:
                break


def test(model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item()  # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)

    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))


def main():
    # Training settings
    parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
    parser.add_argument('--batch-size', type=int, default=64, metavar='N',
                        help='input batch size for training (default: 64)')
    parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
                        help='input batch size for testing (default: 1000)')
    parser.add_argument('--epochs', type=int, default=14, metavar='N',
                        help='number of epochs to train (default: 14)')
    parser.add_argument('--lr', type=float, default=1.0, metavar='LR',
                        help='learning rate (default: 1.0)')
    parser.add_argument('--gamma', type=float, default=0.7, metavar='M',
                        help='Learning rate step gamma (default: 0.7)')
    parser.add_argument('--no-cuda', action='store_true', default=False,
                        help='disables CUDA training')
    parser.add_argument('--dry-run', action='store_true', default=False,
                        help='quickly check a single pass')
    parser.add_argument('--seed', type=int, default=1, metavar='S',
                        help='random seed (default: 1)')
    parser.add_argument('--log-interval', type=int, default=10, metavar='N',
                        help='how many batches to wait before logging training status')
    parser.add_argument('--save-model', action='store_true', default=True,
                        help='For Saving the current Model')
    args = parser.parse_args()
    use_cuda = not args.no_cuda and torch.cuda.is_available()
    print("Using cuda: ", use_cuda)

    torch.manual_seed(args.seed)

    device = torch.device("cuda" if use_cuda else "cpu")

    kwargs = {'batch_size': args.batch_size}
    if use_cuda:
        kwargs.update({'num_workers': 1,
                       'pin_memory': True,
                       'shuffle': True},
                     )

    transform=transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
        ])
    dataset1 = datasets.MNIST('../data', train=True, download=True,
                       transform=transform)
    dataset2 = datasets.MNIST('../data', train=False,
                       transform=transform)
    train_loader = torch.utils.data.DataLoader(dataset1,**kwargs)
    test_loader = torch.utils.data.DataLoader(dataset2, **kwargs)

    model = Net().to(device)
    optimizer = optim.Adadelta(model.parameters(), lr=args.lr)

    scheduler = StepLR(optimizer, step_size=1, gamma=args.gamma)
    for epoch in range(1, args.epochs + 1):
        train(args, model, device, train_loader, optimizer, epoch)
        test(model, device, test_loader)
        scheduler.step()

    if args.save_model:
        model.eval()
        batch = torch.randn((1, 1, 28, 28))
        batch = batch.cuda()
        batch.to(device)
        traced_model = torch.jit.trace(model, batch)
        torch.jit.save(traced_model, 'mnist_cnn.pt')


if __name__ == '__main__':
    main()

Any help would be greatly appreciated. Thank you!

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions