Use multiprocessing when parsing packages. #170

mikepurvis · 2017-05-10T18:32:02Z

As discussed in #169.

It's a nuisance having to put the wrapper/helper class up at module level, but multiprocessing can only pickle top level functions and objects, so I don't see an easy way around this. The class could potentially be avoided by defining separate helpers to always capture/not capture warnings, but that's a matter of taste.

dirk-thomas · 2017-05-10T18:45:01Z

src/catkin_pkg/packages.py

-    for path in package_paths:
-        packages[path] = parse_package(os.path.join(basepath, path), warnings=warnings)
-    return packages
+    parser = _PackageParser(capture_warnings=isinstance(warnings, list))


Should this check we changed to:

capture_warnings=warnings is not None

in order to allow "duck typing"?

Yeah, probably— will change.

dirk-thomas · 2017-05-10T18:45:49Z

src/catkin_pkg/packages.py

+    parser = _PackageParser(capture_warnings=isinstance(warnings, list))
+    results = multiprocessing.Pool(4).map(parser, package_paths)
+    if parser.capture_warnings:
+        map(warnings.extend, [package_warnings for _, _, package_warnings in results])


Maybe use results[2] to avoid the underscores?

Yeah, can do. I find the tuple unpacking a bit more readable than explicit indexing, but it could go either way.

Okay, I've dropped the comprehensions in favour of zipping; I think that's maybe a bit clearer.

dirk-thomas · 2017-05-10T18:46:57Z

src/catkin_pkg/packages.py

-        packages[path] = parse_package(os.path.join(basepath, path), warnings=warnings)
-    return packages
+    parser = _PackageParser(capture_warnings=isinstance(warnings, list))
+    results = multiprocessing.Pool(4).map(parser, package_paths)


Should we use something more "dynamic" like multiprocessing.cpu_count() as a default (and fallback to a hard coded number like 4)?

Looks like you can just leave off the number and it will fall back on cpu_count as the default. Will change (I found empirically that 4 worked well for me).

dirk-thomas · 2017-05-10T19:55:59Z

This looks good.

It might be a bit out of scope for the patch but it would be great if the API would allow the same improvement for this use (https://github.com/ros/rosdistro/blob/master/test/test_build_caches.py#L40-L41) which passes the xml rather than the paths.

dirk-thomas · 2017-05-10T20:29:41Z

I tried to apply the same idea to the parse_packages_by_strings case. But it seems since in that case I already have all the xml in memory and only parse the xml in parallel it doesn't improve the performance (it actually is about 10% slower).

I tried calling the API find_packages_allowing_duplicates locally pointing to different folders. For smaller numbers (e.g. 100) the patched version was slower. For larger cases (e.g. 1000) it was certainly faster. Should be switch between the old and new behavior based on the number of packages or just optimize for the large scale? @ros-infrastructure/ros_team Any opinions?

mikepurvis · 2017-05-10T20:52:11Z

There's certainly a cost to summoning the other processes. I just wasn't sure if the cost was enough to justify switching between implementations, but happy to go either way on it.

mikepurvis · 2017-05-10T20:56:05Z

I don't really have the time/energy for it at the moment, but it would be interesting to study whether the time in the kind of situation is predominantly being spent in syscalls, XML parsing, or Python somewhere. If it's either of the first two, then perhaps there are cases where the GIL isn't being properly released? If that was the case, and it could be fixed, then you'd definitely expect a threaded implementation to be the fastest, particularly since the setup cost then is way smaller.

mikepurvis · 2017-05-10T20:59:52Z

Maybe there's a cleaner implementation here where multiprocessing is only used to get them into memory in parallel, and then they're parsed serially in a parse_package_strings function. That neatly sidesteps the matter of handling warnings properly, since that portion of things goes back to being serial.

dirk-thomas · 2017-05-16T23:29:39Z

I did some more benchmarking with a path containing ~1000 packages:

find_package_paths takes about 1/3 of a second
get_package_xml for all packages takes only 10-15ms (independent of using / not using multiprocessing)
parse_package_string takes 3/4s (without), 1/4s (with multiprocessing)

Therefore I have created an updated PR which uses multiprocessing only for the xml parsing part: see #171.

With this implementation the performance with multiprocessing is only worse for < 10 packages. And the overhead is only around 10ms. Therefore I didn't implement any conditional logic to skip using multiprocessing in case of few packages.

mikepurvis · 2017-05-17T14:17:54Z

Interesting! I admittedly didn't dig in, but it definitely looked to me like the loading rather than parsing was the more expensive piece. Thanks for verifying that not to be the case.

dirk-thomas · 2017-05-17T14:33:41Z

Maybe it also depends on what kind of system your are using? Does the new PR show the same improvements for your use case as this one?

mikepurvis force-pushed the parallel-pkg-parse branch from 33c4151 to d2488fb Compare May 10, 2017 18:36

dirk-thomas reviewed May 10, 2017

View reviewed changes

Use multiprocessing when parsing packages.

a65b88f

mikepurvis force-pushed the parallel-pkg-parse branch from d2488fb to a65b88f Compare May 10, 2017 18:55

Review feedback.

9a0641a

mikepurvis force-pushed the parallel-pkg-parse branch from 03d2852 to 9a0641a Compare May 10, 2017 19:20

dirk-thomas mentioned this pull request May 16, 2017

use multiprocessing when parsing packages (take 2) #171

Merged

dirk-thomas added enhancement duplicate labels May 16, 2017

mikepurvis closed this May 17, 2017

Use multiprocessing when parsing packages. #170

Use multiprocessing when parsing packages. #170

Uh oh!

Conversation

mikepurvis commented May 10, 2017

Uh oh!

dirk-thomas May 10, 2017

Choose a reason for hiding this comment

Uh oh!

mikepurvis May 10, 2017

Choose a reason for hiding this comment

Uh oh!

dirk-thomas May 10, 2017

Choose a reason for hiding this comment

Uh oh!

mikepurvis May 10, 2017

Choose a reason for hiding this comment

Uh oh!

mikepurvis May 10, 2017

Choose a reason for hiding this comment

Uh oh!

dirk-thomas May 10, 2017

Choose a reason for hiding this comment

Uh oh!

mikepurvis May 10, 2017

Choose a reason for hiding this comment

Uh oh!

dirk-thomas commented May 10, 2017

Uh oh!

dirk-thomas commented May 10, 2017

Uh oh!

mikepurvis commented May 10, 2017

Uh oh!

mikepurvis commented May 10, 2017

Uh oh!

mikepurvis commented May 10, 2017

Uh oh!

dirk-thomas commented May 16, 2017

Uh oh!

mikepurvis commented May 17, 2017

Uh oh!

dirk-thomas commented May 17, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants