Hongyang Lin, 2020533180
The benchmark tested on the following machine:
- Supermicro AS-4124GS-TNR
- AMD EPYC 7742 64-Core Processor @ 3.2GHz (64 Cores / 64 Threads) * 2
- 1TB ECC memory
- Ubuntu 20.04.3 5.15.0-52-generic
- building with gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
- cmake version 3.16.3
- -O3 optimization.
run the following command to compile the program:
sudo apt install -y build-essential git libomp cmake
cd $PROJECT_DIRECTORY
mkdir build
cd build
cmake ../
make
./main <input_file> <output_file> <repeat_times>Frist we need to settle a data structure to store all the data. At first I used the Chained Adjacency Table to store the graph, but we found that jumping over the all memory whil cause a bad cache hit rate. After searching some data structures including the sparse matrix discussed in the lecture, I found that Compressed Sparse Row (CSR) format are great to solve the problem with high memory locality. The code can be found in src/graph.cpp.
First I read serval papers talk about the ways of parallel bfs. https://crd.lbl.gov/assets/pubs_presos/mtaapbottomup2D.pdf gave out Parallel 2D top-down BFS algorithm and Parallel 2D bottom-up BFS algorithm that I found efficient to parallel using openmp.
The top-down approach is very similar to the traditional approach in that we maintain a queue and place each expanded node in a buffer and add it to the queue when the current node has finished expanding. The bottom-up approach scans each unvisited vertex, searches its neighbouring vertices, adds the node to the next layer if one exists, and continues to consider other nodes.
These two methods address each other's shortcomings, but still have their advantages and disadvantages. I reference the implentation of a benchmark suite in https://github.com/sbeamer/gapbs/blob/master/src/bfs.cc, they mixed the top-down approach and bottom-up approach to structure a hyper-bfs to get a better performance.
The code can be found in src/bfs.cpp.
I For data that needs to be accessed simultaneously, atomic operations can avoid the performance overhead of locking. I therefore use a large number of atomic operations to reduce overhead. like__sync_bool_compare_and_swap, __sync_fetch_and_add and __sync_bool_compare_and_swap. To be menctioned that some atomic operation need to introduce a refercnce instead of a pointer. Which can be referneced in https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html.
To store a visited[] in bottom-up bfs, we need a structure to store it efficiently. following the implementation in https://github.com/sbeamer/gapbs, I develop an atomic bitmap to check whether the node are visited or not. The core code should be like this:
void set_bit_atomic(size_t pos) {
uint64_t old_val, new_val;
do {
old_val = start_[word_offset(pos)];
new_val = old_val | ((uint64_t) 1l << bit_offset(pos));
} while (!compare_and_swap(start_[word_offset(pos)], old_val, new_val));
}The code can be found in src/bfs.cpp.
As everytime we should add and read a chunk of data in the queue, to get better parallelism, we use the sliding queue to store the data thats need to be accessed. Different from the usual queue, I give two tails that separately point to the end of this level and the end of the whole queue.
To get better performance, In the top-down approach, we can directly access the whole level in rawdata, with the data parallelism by openmp, we can separate the data in chunks and get better performance. The code can be found in src/queue.cpp.
As multiple thread write the queue at a same time will cause a worse performance, I introducted the buffer of each thread, that will temporily store the data in their own buffer. It will flush the buffer and write the data into the main queue if the scan finished or the buffer is full. Everytime we directly copy the buffer into the queue, which will significantly reduce the junk of the queue push operation. The code can be found in src/queue.cpp.
A heuristic can be used to switch between bottom-up and top-down bfs to achieve the best of both worlds. We follow the authors' suggestion on parameter tuning. Note some of the results might be inaccurate, since the graphs may be directed and the heuristic method will have the different result about the parents.
I also test the serval algorithms of parallel BFS program. I benefit much in the open source repo https://github.com/onesuper/bfs_in_parallel. One optimization is to get less lock makes competition more fierce, and since the platform we configured have two sockets and totally 8 NUMA, too severe fierce may cause too severe traffic in the crosslink bus, and higher latency may reduce the performance.
More I referenced the https://github.com/orancanoren/GPU-CPU-Parallel-Graph-Algorithms.
And we found that the current algorithm still have much overhead in the small scale data. Like we shown below, web-Stanford will have worst acclerate rate.
I test our algorithm with the thread {1, 2, 4, 8, 16, 24, 32, 48, 64, 96, 128}. Every case should run in multiple times to avoid errors. The raw result will be placed in the ./output directory.
| TIME | baseline | 2 | 4 | 8 | 16 | 24 | 32 | 48 | 64 | 96 | 128 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| soc-LiveJournal1 | 2413 | 2548 | 1494 | 920 | 995 | 751 | 597 | 302 | 234 | 177 | 160 |
| roadNet-CA | 199 | 634 | 419 | 296 | 252 | 181 | 259 | 206 | 117 | 106 | 134 |
| rmat1 | 1E+05 | 64580 | 37995 | 17920 | 13249 | 14690 | 11048 | 9015 | 8178 | 8008 | 4926 |
| web-Stanford | 34 | 52 | 41 | 38 | 28 | 21 | 17 | 14 | 13 | 14 | 24 |
| com-orkut.ungraph | 1796 | 1449 | 838 | 734 | 782 | 534 | 606 | 379 | 331 | 256 | 249 |
| SPEEDUP RATE | baseline | 2 | 4 | 8 | 16 | 24 | 32 | 48 | 64 | 96 | 128 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| soc-LiveJournal1 | 1 | 0.947 | 1.615 | 2.623 | 2.425 | 3.213 | 4.042 | 7.99 | 10.31 | 13.63 | 15.08 |
| roadNet-CA | 1 | 0.314 | 0.475 | 0.672 | 0.79 | 1.099 | 0.768 | 0.966 | 1.701 | 1.877 | 1.485 |
| rmat1 | 1 | 1.749 | 2.973 | 6.304 | 8.527 | 7.69 | 10.23 | 12.53 | 13.81 | 14.11 | 22.93 |
| web-Stanford | 1 | 0.654 | 0.829 | 0.895 | 1.214 | 1.619 | 2 | 2.429 | 2.615 | 2.429 | 1.417 |
| com-orkut.ungraph | 1 | 1.239 | 2.143 | 2.447 | 2.297 | 3.363 | 2.964 | 4.739 | 5.426 | 7.016 | 7.213 |
