Multi-device parallel dag #424

lantiga · 2020-06-28T22:12:47Z

This PR addresses #370, it brings

the ability to define DAGs involving multiple devices for SCRIPTRUN and MODELRUN
parallel execution of commands within a DAG if they are assigned to different devices

The way DAGs are handled has been refactored. This opens the door for #328 (enabling batching for individual DAG commands).

~~This is a WIP, a few items are still TODO:~~

the implementation so far doesn't handle reusing (overwriting) key names, they need to be unique; we need to add local key mangling to solve this transparently
analyze performance and make sure threads are spin-locking correctly
make sure there are no crashes
increase testing
analyze potential leaks
documentation

The way we allow DAG operations to run on different devices in parallel (when possible) is the following: instead of running the whole DAG in one swoop as in the previous implementation, the DAG run info is created on one queue/device and shallow copied (appropriately) across other queues/devices as indicated by the DAG specification. A DAG mutex is shared across all copies.
The DAG run info is placed on the queue for each device and evicted for execution. Execution happens one DAG op at a time: once the individual op has executed, it is marked as such and the DAG run info is placed back on the queue.
The current un-executed op is checked for its inputs. If all inputs are found in the tensor context, then the DAG op can be executed. If not, the execution quits and control is given back to the worker. If there are other items in the queue the op is placed after the next item.
When all ops for a device have been executed, the DAG is not placed back on the queue. When all ops in a DAG have been executed or an error occurs, the client is unblocked.

codecov · 2020-07-15T07:09:27Z

Codecov Report

Merging #424 into master will increase coverage by 1.40%.
The diff coverage is 90.84%.

@@            Coverage Diff             @@
##           master     #424      +/-   ##
==========================================
+ Coverage   73.26%   74.67%   +1.40%     
==========================================
  Files          21       21              
  Lines        4579     4849     +270     
==========================================
+ Hits         3355     3621     +266     
- Misses       1224     1228       +4

Impacted Files	Coverage Δ
src/config.c	`28.57% <0.00%> (ø)`
src/backends/util.c	`38.46% <33.33%> (-3.21%)`	⬇️
src/script.c	`63.52% <66.66%> (-0.45%)`	⬇️
src/model.c	`70.80% <71.42%> (+2.04%)`	⬆️
src/redisai.c	`79.32% <85.00%> (+1.35%)`	⬆️
src/background_workers.c	`84.44% <88.63%> (+0.93%)`	⬆️
src/dag.c	`89.15% <95.13%> (+0.84%)`	⬆️
src/run_info.c	`88.37% <95.58%> (+2.03%)`	⬆️
src/model_script_run_session.c	`91.42% <100.00%> (-0.24%)`	⬇️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3019629...a51b1f5. Read the comment docs.

DvirDukhan · 2020-07-19T10:59:03Z

test/tests_tensorflow.py

        exception = e
        env.assertEqual(type(exception), redis.exceptions.ResponseError)
-        env.assertEqual("tensor key is empty", exception.__str__())
+        env.assertEqual("Number of names given as OUTPUTS during MODELSET and keys given as INPUTS here do not match", exception.__str__())


Suggested change

env.assertEqual("Number of names given as OUTPUTS during MODELSET and keys given as INPUTS here do not match", exception.__str__())

env.assertEqual("Number of names given as OUTPUTS during MODELSET and keys given as OUTPUTS here do not match", exception.__str__())

DvirDukhan · 2020-07-19T11:01:08Z

test/tests_dag.py


        ret = con.execute_command(
-            'AI.DAGRUN_RO', '|>',
+            'AI.DAGRUN_RO', # '|>',


DvirDukhan · 2020-07-19T11:15:56Z

test/tests_dag.py

        exception = e
        env.assertEqual(type(exception), redis.exceptions.ResponseError)
-        env.assertEqual("ERR unsupported command within DAG",exception.__str__())
+        env.assertEqual("INPUTS not specified",exception.__str__())


why INPUTS not specified?

DvirDukhan · 2020-07-19T11:40:11Z

src/background_workers.c

+  for (long long i=0; i<strlen(output); i++) {
+    output[i] = toupper(output[i]);


get the string length outside the for loop or change into while loop

long long i = 0; while(output[i]) { output[i]=toupper(output[i]); i++ }

DvirDukhan · 2020-07-19T13:30:30Z

src/background_workers.c

+        if (use_local_context == 1) {
+          pthread_mutex_lock(evicted_rinfo->dagMutex);
+          dagError = *evicted_rinfo->dagError;
+          pthread_mutex_unlock(evicted_rinfo->dagMutex);
+        }
+        if (use_local_context == 1 && dag_complete == 0 && !dagError) {


all you care about here is RunInfo with use_local_context == 1 right?

can you collect them from the evicted items without locking the run_queue_mutex

you can avoid the second if statement since the condition above is not valid

I guess reintroducing items to the queue needs to be synched, right?

The thing is that dagError is a pointer that is shared between worker threads (there are copies of rinfo in every pertinent queue that share some of the items, like dagMutex, dagError and the dagOps - see https://github.com/RedisAI/RedisAI/blob/multidevice_dag/src/run_info.c#L147), so we indeed need to acquire the lock. The rest of the condition needs to happen only if no errors were generated in the step.
Would you simplify it further

DvirDukhan · 2020-07-20T09:46:08Z

src/dag.c

+      RedisModule_UnblockClient(rinfo->client, rinfo);
+    }
+    pthread_mutex_unlock(rinfo->dagMutex);
+    return NULL;


change function signature to return void

DvirDukhan · 2020-07-20T09:47:19Z

src/dag.c

+  int dag_error = 0;
+  char *detail_oneline;
+
+  for (size_t i = 0; i < array_len(rinfo->dagOps); i++) {


use array_len(rinfo->dagOps) as variable

DvirDukhan · 2020-07-20T11:10:46Z

src/dag.c

+  array_free(rinfo_copies);

  return REDISMODULE_OK;
 }


Sorry I'm not sure I get this one

Oh I get it now :-)

DvirDukhan · 2020-07-20T11:16:08Z

src/dag.c

+                                          "ERR PERSIST key cannot be found in DAG");
+      } 
+      int *instance = AI_dictGetVal(mangled_entry);
+      RedisModuleString *mangled_key = RedisModule_CreateStringPrintf(ctx, "%s%04d", key, *instance);


possible leak?

Nice catch, now that we mangle we need to free the inkeys and outkeys strings when freeing the ops.

DvirDukhan · 2020-07-20T11:20:55Z

src/dag.c

+
+  const char* master_device = rinfo->dagOps[array_len(rinfo->dagOps)-1]->devicestr;
+
+  RedisAI_RunInfo **rinfo_copies = array_new(RedisAI_RunInfo*, 10);


allocate with capacity as the number of devices

filipecosta90

@lantiga lets push together increased testing on the uncovered parts added, and also include the documentation to the added methods ( you already describe the changes in the PR so we just need to add the relevant description to the code as well )

filipecosta90 · 2020-07-22T10:42:20Z

src/run_info.h

  int dagReplyLength;
  int dagNumberCommands;
+  int *dagError;
+  pthread_mutex_t* dagMutex;


explain the necessity of this addition

filipecosta90 · 2020-07-22T10:43:02Z

src/dag.c

-        currentOp->result = REDISMODULE_ERR;
-        break;
-      }
+void *RedisAI_DagRunSession_TensorSet_Step(RedisAI_RunInfo *rinfo, RAI_DagOp *currentOp, int *progress) {


document me

filipecosta90 · 2020-07-22T10:43:10Z

src/dag.c

+  return NULL;
+}
+
+void *RedisAI_DagRunSession_TensorGet_Step(RedisAI_RunInfo *rinfo, RAI_DagOp *currentOp, int *progress) {


document me

filipecosta90 · 2020-07-22T10:43:19Z

src/dag.c

+  return NULL;
+}
+
+void *RedisAI_DagRunSession_ModelRun_Step(RedisAI_RunInfo *rinfo, RAI_DagOp *currentOp, int *progress) {


document me

filipecosta90 · 2020-07-22T10:43:41Z

src/dag.c

+  return NULL;
+}
+
+void *RedisAI_DagRunSession_ScriptRun_Step(RedisAI_RunInfo *rinfo, RAI_DagOp *currentOp, int *progress) {


document me

filipecosta90 · 2020-07-22T10:57:32Z

src/dag.c

+    RAI_Tensor *inputTensor;
+    const int get_result = RAI_getTensorFromLocalContext(
+        NULL, rinfo->dagTensorsContext, RedisModule_StringPtrLen(currentOp->inkeys[i], NULL), &inputTensor, currentOp->err);
+    if (get_result == REDISMODULE_ERR) {


Add testcase for it. coverage report pointing this line as not covered

filipecosta90 · 2020-07-22T10:58:12Z

src/dag.c

+  }
+
+  for (int i=0; i<array_len(currentOp->outkeys); i++) {
+    if (!RAI_ScriptRunCtxAddOutput(currentOp->sctx)) {


Add testcase for it. coverage report pointing this line as not covered

filipecosta90 · 2020-07-22T10:59:13Z

src/dag.c

+
+  if (*rinfo->dagError == 1 && rinfo->client != NULL) {
    RedisModule_UnblockClient(rinfo->client, rinfo);
+    pthread_mutex_unlock(rinfo->dagMutex);


Add testcase for it. coverage report pointing this line as not covered

filipecosta90 · 2020-07-22T11:00:10Z

src/dag.c

+        struct RedisAI_RunStats *rstats = NULL;
+        const char *runkey = RedisModule_StringPtrLen(currentOp->runkey, NULL);
+        RAI_GetRunStats(runkey,&rstats);
+        if (currentOp->result == REDISMODULE_ERR) {


Add testcase for it. coverage report pointing this line as not covered

filipecosta90 · 2020-07-22T11:01:55Z

src/background_workers.c

+            queueUnpop(run_queue_info->run_queue, evicted_rinfo);
+          }
+          else {
+            if (queueLength(run_queue_info->run_queue) > 0) {


testcase for it. codecov report showcasing we're not covering this

lantiga · 2020-07-28T12:32:19Z

During coverage increase I unearthed a race condition that can lead to a crash.

There's a portion of a test in test/tests_dag.py, function test_dagrun_modelrun_multidevice_resnet_ensemble_alias, that has been commented out.

What happens is the following:

model1 -> script (with error)
model2 --->

The script errors out and unblocks, but model2 is still running. We should indeed only unblock when all threads have done processing. This already happens under normal circumstances, but when errors are generated in-between with certain DAG topologies this is not explicitly checked for. We need to take care of this prior to merging the PR.

/cc @DvirDukhan @filipecosta90

lantiga · 2020-07-31T22:45:53Z

The edge case has now been fixed. Unblocking is now controlled through ref counting.

filipecosta90

LGTM. Lets merge it and address the pending issue with this already in-placed

lantiga marked this pull request as draft June 28, 2020 22:12

lantiga force-pushed the multidevice_dag branch from 1a060a0 to ba3f396 Compare July 7, 2020 21:20

lantiga force-pushed the multidevice_dag branch 3 times, most recently from c343883 to d62fb4b Compare July 15, 2020 14:24

lantiga changed the title ~~[WIP] Multi-device parallel dag~~ Multi-device parallel dag Jul 16, 2020

lantiga marked this pull request as ready for review July 16, 2020 09:05

lantiga requested review from DvirDukhan and filipecosta90 and removed request for DvirDukhan July 16, 2020 21:39

DvirDukhan reviewed Jul 20, 2020

View reviewed changes

lantiga mentioned this pull request Jul 22, 2020

Add getkeys-api to the DAGRUN command #435

Closed

filipecosta90 requested changes Jul 22, 2020

View reviewed changes

lantiga added 15 commits July 26, 2020 00:16

Parallel DAG

759cfbe

Make sure we don't access dagMutex if we're not in dag

9ae24b4

Add alias and join test

7ddb607

Remove stale comment

01cffbf

Improve DAG test with topological issues

f1e83ee

Remove commented code

1af2491

Remove commented code

26a9c82

Add parallel DAG mention in documentation

5a28e57

Manage strdup'ing properly

ae80e81

Temporarily deactivate error checking tests

6e38315

Avoid starving the thread if there's nothing on the queue

0e4d4b0

Make sure unblocking comes from the right thread

e165dc9

Change script device to avoid CUDA OOM on CI

0c88842

Change script device to avoid CUDA OOM on CI

f26f1bb

Address review 1

d374ddf

lantiga force-pushed the multidevice_dag branch from 65c217c to d374ddf Compare July 26, 2020 14:17

lantiga added 5 commits July 26, 2020 16:20

Add newline

701eb63

Remove commented code

2c87a87

Fix sanitizer test

1127392

Increase coverage

7234369

Increase coverage, remove unreachable conditions

7f4fc87

Fix variadic

6a22676

lantiga requested review from DvirDukhan and filipecosta90 July 31, 2020 22:46

Introduce ref counting for unblocking dags, add comments

a51b1f5

lantiga force-pushed the multidevice_dag branch from 263f47c to a51b1f5 Compare August 1, 2020 09:13

filipecosta90 approved these changes Aug 5, 2020

View reviewed changes

filipecosta90 merged commit 0700917 into master Aug 5, 2020

filipecosta90 deleted the multidevice_dag branch August 5, 2020 13:26

This was referenced Aug 6, 2020

Add support for multiple devices from within an individual DAG #370

Closed

Fixed flagged as "getkeys-api" during the registration ( AI.DAGRUN, AI.DAGRUN_RO, AI.MODELRUN, AI.SCRIPTRUN ) #438

Merged

	env.assertEqual("Number of names given as OUTPUTS during MODELSET and keys given as INPUTS here do not match", exception.__str__())
	env.assertEqual("Number of names given as OUTPUTS during MODELSET and keys given as OUTPUTS here do not match", exception.__str__())

		for (long long i=0; i<strlen(output); i++) {
		output[i] = toupper(output[i]);


		const char* master_device = rinfo->dagOps[array_len(rinfo->dagOps)-1]->devicestr;

		RedisAI_RunInfo *rinfo_copies = array_new(RedisAI_RunInfo, 10);

Multi-device parallel dag #424

Multi-device parallel dag #424

Uh oh!

Conversation

lantiga commented Jun 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jul 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

filipecosta90 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lantiga commented Jul 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lantiga commented Jul 31, 2020

Uh oh!

filipecosta90 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lantiga commented Jun 28, 2020 •

edited

Loading

codecov bot commented Jul 15, 2020 •

edited

Loading

lantiga commented Jul 28, 2020 •

edited

Loading