Skip to content

Conversation

lantiga
Copy link
Contributor

@lantiga lantiga commented Jun 28, 2020

This PR addresses #370, it brings

  • the ability to define DAGs involving multiple devices for SCRIPTRUN and MODELRUN
  • parallel execution of commands within a DAG if they are assigned to different devices

The way DAGs are handled has been refactored. This opens the door for #328 (enabling batching for individual DAG commands).

This is a WIP, a few items are still TODO:

  • the implementation so far doesn't handle reusing (overwriting) key names, they need to be unique; we need to add local key mangling to solve this transparently
  • analyze performance and make sure threads are spin-locking correctly
  • make sure there are no crashes
  • increase testing
  • analyze potential leaks
  • documentation

The way we allow DAG operations to run on different devices in parallel (when possible) is the following: instead of running the whole DAG in one swoop as in the previous implementation, the DAG run info is created on one queue/device and shallow copied (appropriately) across other queues/devices as indicated by the DAG specification. A DAG mutex is shared across all copies.
The DAG run info is placed on the queue for each device and evicted for execution. Execution happens one DAG op at a time: once the individual op has executed, it is marked as such and the DAG run info is placed back on the queue.
The current un-executed op is checked for its inputs. If all inputs are found in the tensor context, then the DAG op can be executed. If not, the execution quits and control is given back to the worker. If there are other items in the queue the op is placed after the next item.
When all ops for a device have been executed, the DAG is not placed back on the queue. When all ops in a DAG have been executed or an error occurs, the client is unblocked.

@lantiga lantiga marked this pull request as draft June 28, 2020 22:12
@lantiga lantiga force-pushed the multidevice_dag branch from 1a060a0 to ba3f396 Compare July 7, 2020 21:20
@codecov
Copy link

codecov bot commented Jul 15, 2020

Codecov Report

Merging #424 into master will increase coverage by 1.40%.
The diff coverage is 90.84%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #424      +/-   ##
==========================================
+ Coverage   73.26%   74.67%   +1.40%     
==========================================
  Files          21       21              
  Lines        4579     4849     +270     
==========================================
+ Hits         3355     3621     +266     
- Misses       1224     1228       +4     
Impacted Files Coverage Δ
src/config.c 28.57% <0.00%> (ø)
src/backends/util.c 38.46% <33.33%> (-3.21%) ⬇️
src/script.c 63.52% <66.66%> (-0.45%) ⬇️
src/model.c 70.80% <71.42%> (+2.04%) ⬆️
src/redisai.c 79.32% <85.00%> (+1.35%) ⬆️
src/background_workers.c 84.44% <88.63%> (+0.93%) ⬆️
src/dag.c 89.15% <95.13%> (+0.84%) ⬆️
src/run_info.c 88.37% <95.58%> (+2.03%) ⬆️
src/model_script_run_session.c 91.42% <100.00%> (-0.24%) ⬇️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3019629...a51b1f5. Read the comment docs.

@lantiga lantiga force-pushed the multidevice_dag branch 3 times, most recently from c343883 to d62fb4b Compare July 15, 2020 14:24
@lantiga lantiga changed the title [WIP] Multi-device parallel dag Multi-device parallel dag Jul 16, 2020
@lantiga lantiga marked this pull request as ready for review July 16, 2020 09:05
@lantiga lantiga requested review from DvirDukhan and filipecosta90 and removed request for DvirDukhan July 16, 2020 21:39
exception = e
env.assertEqual(type(exception), redis.exceptions.ResponseError)
env.assertEqual("tensor key is empty", exception.__str__())
env.assertEqual("Number of names given as OUTPUTS during MODELSET and keys given as INPUTS here do not match", exception.__str__())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
env.assertEqual("Number of names given as OUTPUTS during MODELSET and keys given as INPUTS here do not match", exception.__str__())
env.assertEqual("Number of names given as OUTPUTS during MODELSET and keys given as OUTPUTS here do not match", exception.__str__())


ret = con.execute_command(
'AI.DAGRUN_RO', '|>',
'AI.DAGRUN_RO', # '|>',

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#?

exception = e
env.assertEqual(type(exception), redis.exceptions.ResponseError)
env.assertEqual("ERR unsupported command within DAG",exception.__str__())
env.assertEqual("INPUTS not specified",exception.__str__())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why INPUTS not specified?

Comment on lines 58 to 60
for (long long i=0; i<strlen(output); i++) {
output[i] = toupper(output[i]);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get the string length outside the for loop or change into while loop

long long i = 0;
while(output[i]) {
    output[i]=toupper(output[i]);
    i++
}

Comment on lines 227 to 233
if (use_local_context == 1) {
pthread_mutex_lock(evicted_rinfo->dagMutex);
dagError = *evicted_rinfo->dagError;
pthread_mutex_unlock(evicted_rinfo->dagMutex);
}
if (use_local_context == 1 && dag_complete == 0 && !dagError) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all you care about here is RunInfo with use_local_context == 1 right?

  1. can you collect them from the evicted items without locking the run_queue_mutex
  2. you can avoid the second if statement since the condition above is not valid

I guess reintroducing items to the queue needs to be synched, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is that dagError is a pointer that is shared between worker threads (there are copies of rinfo in every pertinent queue that share some of the items, like dagMutex, dagError and the dagOps - see https://github.com/RedisAI/RedisAI/blob/multidevice_dag/src/run_info.c#L147), so we indeed need to acquire the lock. The rest of the condition needs to happen only if no errors were generated in the step.
Would you simplify it further

src/dag.c Outdated
RedisModule_UnblockClient(rinfo->client, rinfo);
}
pthread_mutex_unlock(rinfo->dagMutex);
return NULL;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change function signature to return void

src/dag.c Outdated
int dag_error = 0;
char *detail_oneline;

for (size_t i = 0; i < array_len(rinfo->dagOps); i++) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use array_len(rinfo->dagOps) as variable

src/dag.c Outdated
array_free(rinfo_copies);

return REDISMODULE_OK;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm not sure I get this one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I get it now :-)

"ERR PERSIST key cannot be found in DAG");
}
int *instance = AI_dictGetVal(mangled_entry);
RedisModuleString *mangled_key = RedisModule_CreateStringPrintf(ctx, "%s%04d", key, *instance);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possible leak?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, now that we mangle we need to free the inkeys and outkeys strings when freeing the ops.

src/dag.c Outdated

const char* master_device = rinfo->dagOps[array_len(rinfo->dagOps)-1]->devicestr;

RedisAI_RunInfo **rinfo_copies = array_new(RedisAI_RunInfo*, 10);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allocate with capacity as the number of devices

Copy link
Collaborator

@filipecosta90 filipecosta90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lantiga lets push together increased testing on the uncovered parts added, and also include the documentation to the added methods ( you already describe the changes in the PR so we just need to add the relevant description to the code as well )

src/run_info.h Outdated
int dagReplyLength;
int dagNumberCommands;
int *dagError;
pthread_mutex_t* dagMutex;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain the necessity of this addition

src/dag.c Outdated
currentOp->result = REDISMODULE_ERR;
break;
}
void *RedisAI_DagRunSession_TensorSet_Step(RedisAI_RunInfo *rinfo, RAI_DagOp *currentOp, int *progress) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document me

src/dag.c Outdated
return NULL;
}

void *RedisAI_DagRunSession_TensorGet_Step(RedisAI_RunInfo *rinfo, RAI_DagOp *currentOp, int *progress) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document me

src/dag.c Outdated
return NULL;
}

void *RedisAI_DagRunSession_ModelRun_Step(RedisAI_RunInfo *rinfo, RAI_DagOp *currentOp, int *progress) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document me

src/dag.c Outdated
return NULL;
}

void *RedisAI_DagRunSession_ScriptRun_Step(RedisAI_RunInfo *rinfo, RAI_DagOp *currentOp, int *progress) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document me

RAI_Tensor *inputTensor;
const int get_result = RAI_getTensorFromLocalContext(
NULL, rinfo->dagTensorsContext, RedisModule_StringPtrLen(currentOp->inkeys[i], NULL), &inputTensor, currentOp->err);
if (get_result == REDISMODULE_ERR) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add testcase for it. coverage report pointing this line as not covered

src/dag.c Outdated
}

for (int i=0; i<array_len(currentOp->outkeys); i++) {
if (!RAI_ScriptRunCtxAddOutput(currentOp->sctx)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add testcase for it. coverage report pointing this line as not covered

src/dag.c Outdated

if (*rinfo->dagError == 1 && rinfo->client != NULL) {
RedisModule_UnblockClient(rinfo->client, rinfo);
pthread_mutex_unlock(rinfo->dagMutex);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add testcase for it. coverage report pointing this line as not covered

struct RedisAI_RunStats *rstats = NULL;
const char *runkey = RedisModule_StringPtrLen(currentOp->runkey, NULL);
RAI_GetRunStats(runkey,&rstats);
if (currentOp->result == REDISMODULE_ERR) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add testcase for it. coverage report pointing this line as not covered

queueUnpop(run_queue_info->run_queue, evicted_rinfo);
}
else {
if (queueLength(run_queue_info->run_queue) > 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testcase for it. codecov report showcasing we're not covering this

@lantiga
Copy link
Contributor Author

lantiga commented Jul 28, 2020

During coverage increase I unearthed a race condition that can lead to a crash.

There's a portion of a test in test/tests_dag.py, function test_dagrun_modelrun_multidevice_resnet_ensemble_alias, that has been commented out.

What happens is the following:

model1 -> script (with error)
model2 ---> 

The script errors out and unblocks, but model2 is still running. We should indeed only unblock when all threads have done processing. This already happens under normal circumstances, but when errors are generated in-between with certain DAG topologies this is not explicitly checked for. We need to take care of this prior to merging the PR.

/cc @DvirDukhan @filipecosta90

@lantiga
Copy link
Contributor Author

lantiga commented Jul 31, 2020

The edge case has now been fixed. Unblocking is now controlled through ref counting.

Copy link
Collaborator

@filipecosta90 filipecosta90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Lets merge it and address the pending issue with this already in-placed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants