-
Notifications
You must be signed in to change notification settings - Fork 986
Description
Beginning with #14000, it was noticed that the ParquetWriterTest.ManyFragments unit test would fail at random times. Rerunning the failing test would clear the error. But after that PR was merged, the error began to appear in other PRs as that change was merged in. Running the test through racecheck yielded several reported problems.
========= Error: Race reported between Write access at 0x1d10 in /home/seidl2/src/cudf_branches/cudf_strings/cpp/build/_deps/thrust-src/dependencies/cub/cub/block/specializations/block_reduce_warp_reductions.cuh:146:unsigned int cub::CUB_101702_860_NS::BlockReduceWarpReductions<unsigned int, (int)512, (int)1, (int)1, (int)860>::ApplyWarpAggregates<(bool)1, cub::CUB_101702_860_NS::Sum>(T2, unsigned int, int)
========= and Read access at 0x1b90 in /home/seidl2/src/cudf_branches/cudf_strings/cpp/build/_deps/thrust-src/dependencies/cub/cub/block/specializations/block_reduce_warp_reductions.cuh:117:unsigned int cub::CUB_101702_860_NS::BlockReduceWarpReductions<unsigned int, (int)512, (int)1, (int)1, (int)860>::ApplyWarpAggregates<(bool)1, cub::CUB_101702_860_NS::Sum, (int)1>(T2, unsigned int, int, cub::CUB_101702_860_NS::Int2Type<T3>) [240 hazards]
========= and Read access at 0x1ba0 in /home/seidl2/src/cudf_branches/cudf_strings/cpp/build/_deps/thrust-src/dependencies/cub/cub/block/specializations/block_reduce_warp_reductions.cuh:117:unsigned int cub::CUB_101702_860_NS::BlockReduceWarpReductions<unsigned int, (int)512, (int)1, (int)1, (int)860>::ApplyWarpAggregates<(bool)1, cub::CUB_101702_860_NS::Sum, (int)2>(T2, unsigned int, int, cub::CUB_101702_860_NS::Int2Type<T3>) [320 hazards]
========= and Read access at 0x1bb0 in /home/seidl2/src/cudf_branches/cudf_strings/cpp/build/_deps/thrust-src/dependencies/cub/cub/block/specializations/block_reduce_warp_reductions.cuh:117:unsigned int cub::CUB_101702_860_NS::BlockReduceWarpReductions<unsigned int, (int)512, (int)1, (int)1, (int)860>::ApplyWarpAggregates<(bool)1, cub::CUB_101702_860_NS::Sum, (int)5>(T2, unsigned int, int, cub::CUB_101702_860_NS::Int2Type<T3>) [320 hazards]
========= and Read access at 0x1bc0 in /home/seidl2/src/cudf_branches/cudf_strings/cpp/build/_deps/thrust-src/dependencies/cub/cub/block/specializations/block_reduce_warp_reductions.cuh:117:unsigned int cub::CUB_101702_860_NS::BlockReduceWarpReductions<unsigned int, (int)512, (int)1, (int)1, (int)860>::ApplyWarpAggregates<(bool)1, cub::CUB_101702_860_NS::Sum, (int)9>(T2, unsigned int, int, cub::CUB_101702_860_NS::Int2Type<T3>) [320 hazards]
and
========= Error: Race reported between Read access at 0x68b0 in /home/seidl2/src/cudf_branches/cudf_strings/cpp/src/io/parquet/page_data.cu:513:void cudf::io::parquet::detail::<unnamed>::gpuDecodePageData<(int)256, unsigned char>(cudf::io::parquet::detail::PageInfo *, cudf::device_span<const cudf::io::parquet::detail::ColumnChunkDesc, (unsigned long)18446744073709551615>, unsigned long, unsigned long, unsigned int *)
========= and Write access at 0x17440 in /home/seidl2/src/cudf_branches/cudf_strings/cpp/src/io/parquet/page_decode.cuh:766:void cudf::io::parquet::detail::gpuUpdateValidityOffsetsAndRowIndices<unsigned char, cudf::io::parquet::detail::page_state_buffers_s<(int)256, (int)256, (int)256>, (int)256>(int, cudf::io::parquet::detail::page_state_s *, T2 *, const T1 *, const T1 *, int) [11266560 hazards]
=========
========= Warning: Race reported between Read access at 0x156c0 in /home/seidl2/src/cudf_branches/cudf_strings/cpp/src/io/parquet/page_decode.cuh:476:void cudf::io::parquet::detail::gpuDecodeStream<unsigned char, (int)256>(T1 *, cudf::io::parquet::detail::page_state_s *, int, int, cudf::io::parquet::detail::level_type)
========= and Write access at 0x16d10 in /home/seidl2/src/cudf_branches/cudf_strings/cpp/src/io/parquet/page_decode.cuh:555:void cudf::io::parquet::detail::gpuDecodeStream<unsigned char, (int)256>(T1 *, cudf::io::parquet::detail::page_state_s *, int, int, cudf::io::parquet::detail::level_type) [5120 hazards]
=========
========= Warning: Race reported between Read access at 0x15680 in /home/seidl2/src/cudf_branches/cudf_strings/cpp/src/io/parquet/page_decode.cuh:482:void cudf::io::parquet::detail::gpuDecodeStream<unsigned char, (int)256>(T1 *, cudf::io::parquet::detail::page_state_s *, int, int, cudf::io::parquet::detail::level_type)
========= and Write access at 0x16d00 in /home/seidl2/src/cudf_branches/cudf_strings/cpp/src/io/parquet/page_decode.cuh:558:void cudf::io::parquet::detail::gpuDecodeStream<unsigned char, (int)256>(T1 *, cudf::io::parquet::detail::page_state_s *, int, int, cudf::io::parquet::detail::level_type) [2560 hazards]
=========
========= Warning: Race reported between Read access at 0x156a0 in /home/seidl2/src/cudf_branches/cudf_strings/cpp/src/io/parquet/page_decode.cuh:478:void cudf::io::parquet::detail::gpuDecodeStream<unsigned char, (int)256>(T1 *, cudf::io::parquet::detail::page_state_s *, int, int, cudf::io::parquet::detail::level_type)
========= and Write access at 0x16cc0 in /home/seidl2/src/cudf_branches/cudf_strings/cpp/src/io/parquet/page_decode.cuh:556:void cudf::io::parquet::detail::gpuDecodeStream<unsigned char, (int)256>(T1 *, cudf::io::parquet::detail::page_state_s *, int, int, cudf::io::parquet::detail::level_type) [2560 hazards]
=========
========= Warning: Race reported between Read access at 0x156b0 in /home/seidl2/src/cudf_branches/cudf_strings/cpp/src/io/parquet/page_decode.cuh:479:void cudf::io::parquet::detail::gpuDecodeStream<unsigned char, (int)256>(T1 *, cudf::io::parquet::detail::page_state_s *, int, int, cudf::io::parquet::detail::level_type)
========= and Write access at 0x16ce0 in /home/seidl2/src/cudf_branches/cudf_strings/cpp/src/io/parquet/page_decode.cuh:557:void cudf::io::parquet::detail::gpuDecodeStream<unsigned char, (int)256>(T1 *, cudf::io::parquet::detail::page_state_s *, int, int, cudf::io::parquet::detail::level_type) [2560 hazards]
The former error implicates a change made in #14000 to the page fragment size calculation. Adding a __syncthreads() call between two invocations of cub::BlockReduce.Sum fixes it. The latter error (page_data.cuh:513) is related to the page_state_buffers_s.nz_idx array. This is likely benign as one warp is writing to one part of the array while another warp is reading a different region, and this code has not been changed in quite some time. The remaining warnings in page_decode.cuh are due to a race in gpuDecodeStream in which thread 0 might modify shared values before all threads have read them.
One additional datapoint, however, is that @vuule was able to reproduce the error locally and save the parquet file that was the product of the failed test. Inspection of the file found no errors, implicating the read path as the source of the error. However, the fragment size bug above is on the write path, so there may be more to the story.