@@ -71,117 +71,15 @@ Note for instructors: use `run*.sh` scripts to generate all the output files.
7171 export NVCOMPILER_ACC_NOTIFY=$((0x1 | 0x2))
7272 srun -p gputest --nodes=1 --ntasks-per-node=1 --cpus-per-task=4 --gres=gpu:a100:1 -t 0:10:00 ./axpy.x
7373
74- Output from C version:
75-
76- upload CUDA data file=.../axpy.c function=main line=27 device=0 threadid=1 variable=y[:] bytes=819200
77- upload CUDA data file=.../axpy.c function=main line=27 device=0 threadid=1 variable=x[:] bytes=819200
78- launch CUDA kernel file=.../axpy.c function=main line=27 device=0 host-threadid=0 num_teams=0 thread_limit=0 kernelname=nvkernel_main_F1L27_2 grid=<<<800,1,1>>> block=<<<128,1,1>>> shmem=0b
79- download CUDA data file=.../axpy.c function=main line=33 device=0 threadid=1 variable=x[:] bytes=819200
80- download CUDA data file=.../axpy.c function=main line=33 device=0 threadid=1 variable=y[:] bytes=819200
81- Using N = 102400
82- Input:
83- a = 3.0000
84- x = 0.0000 0.0000 0.0000 0.0000 ... 1.0000 1.0000 1.0000 1.0000
85- y = 0.0000 0.0010 0.0020 0.0029 ... 99.9971 99.9980 99.9990 100.0000
86- Output:
87- y = 0.0000 0.0010 0.0020 0.0030 ... 102.9970 102.9980 102.9990 103.0000
88-
89- Output from Fortran version:
90-
91- upload CUDA data file=.../axpy.F90 function=axpy line=31 device=0 threadid=1 variable=descriptor bytes=128
92- upload CUDA data file=.../axpy.F90 function=axpy line=31 device=0 threadid=1 variable=y(:) bytes=819200
93- upload CUDA data file=.../axpy.F90 function=axpy line=31 device=0 threadid=1 variable=descriptor bytes=128
94- upload CUDA data file=.../axpy.F90 function=axpy line=31 device=0 threadid=1 variable=x(:) bytes=819200
95- launch CUDA kernel file=.../axpy.F90 function=axpy line=31 device=0 host-threadid=0 num_teams=0 thread_limit=0 kernelname=nvkernel_MAIN__F1L31_2_ grid=<<<800,1,1>>> block=<<<128,1,1>>> shmem=0b
96- download CUDA data file=.../axpy.F90 function=axpy line=35 device=0 threadid=1 variable=x(:) bytes=819200
97- download CUDA data file=.../axpy.F90 function=axpy line=35 device=0 threadid=1 variable=y(:) bytes=819200
98- Using N = 102400
99- Input:
100- a = 3.0000
101- x = 0.0000 0.0000 0.0000 0.0000 ... 1.0000 1.0000 1.0000 1.0000
102- y = 0.0000 0.0010 0.0020 0.0029 ... 99.9971 99.9980 99.9990 100.0000
103- Output:
104- y = 0.0000 0.0010 0.0020 0.0030 ... 102.9970 102.9980 102.9990 103.0000
74+ Outputs are in ` axpy_{c,f}_nv_debug.out ` for C and Fortran, respectively.
10575
10676 We can read the memory transfers and kernel execution with 800 blocks and 128 threads per block from this output.
10777
108- Now ` NVCOMPILER_ACC_NOTIFY=$((0x1F)) ` .
109- Output from C version:
110-
111- Enter enter data construct file=.../axpy.c function=main line=27 device=0 threadid=1
112- create CUDA data bytes=819200 file=.../axpy.c function=main line=27 device=0 threadid=1
113- alloc CUDA data devaddr=0x7fff0b2fa000 bytes=819200 file=.../axpy.c function=main line=27 device=0 threadid=1
114- upload CUDA data file=.../axpy.c function=main line=27 device=0 threadid=1 variable=y[:] bytes=819200
115- create CUDA data bytes=819200 file=.../axpy.c function=main line=27 device=0 threadid=1
116- alloc CUDA data devaddr=0x7fff0b800000 bytes=819200 file=.../axpy.c function=main line=27 device=0 threadid=1
117- upload CUDA data file=.../axpy.c function=main line=27 device=0 threadid=1 variable=x[:] bytes=819200
118- Leave enter data .../axpy.c main:27 device=0 threadid=1
119- launch CUDA kernel file=.../axpy.c function=main line=27 device=0 host-threadid=0 num_teams=0 thread_limit=0 kernelname=nvkernel_main_F1L27_2 grid=<<<800,1,1>>> block=<<<128,1,1>>> shmem=0b
120- Enter exit data construct file=.../axpy.c function=main line=27 device=0 threadid=1
121- download CUDA data file=.../axpy.c function=main line=33 device=0 threadid=1 variable=x[:] bytes=819200
122- delete CUDA data devaddr=0x7fff0b800000 bytes=819200 file=.../axpy.c function=main line=33 device=0 threadid=1
123- download CUDA data file=.../axpy.c function=main line=33 device=0 threadid=1 variable=y[:] bytes=819200
124- delete CUDA data devaddr=0x7fff0b2fa000 bytes=819200 file=.../axpy.c function=main line=33 device=0 threadid=1
125- Implicit wait file=.../axpy.c function=main line=33 device=0 threadid=1 queue=acc_async_sync
126- Leave exit data .../axpy.c main:33 device=0 threadid=1
127- Using N = 102400
128- Input:
129- a = 3.0000
130- x = 0.0000 0.0000 0.0000 0.0000 ... 1.0000 1.0000 1.0000 1.0000
131- y = 0.0000 0.0010 0.0020 0.0029 ... 99.9971 99.9980 99.9990 100.0000
132- Output:
133- y = 0.0000 0.0010 0.0020 0.0030 ... 102.9970 102.9980 102.9990 103.0000
134-
135- Output from Fortran version:
136-
137- Enter enter data construct file=.../axpy.F90 function=axpy line=31 device=0 threadid=1
138- create CUDA data bytes=819200 file=.../axpy.F90 function=axpy line=31 device=0 threadid=1
139- alloc CUDA data devaddr=0x7fff0b2fa000 bytes=819200 file=.../axpy.F90 function=axpy line=31 device=0 threadid=1
140- create CUDA data bytes=128 file=.../axpy.F90 function=axpy line=31 device=0 threadid=1
141- alloc CUDA data devaddr=0x7fff0b3c2000 bytes=512 file=.../axpy.F90 function=axpy line=31 device=0 threadid=1
142- upload CUDA data file=.../axpy.F90 function=axpy line=31 device=0 threadid=1 variable=descriptor bytes=128
143- upload CUDA data file=.../axpy.F90 function=axpy line=31 device=0 threadid=1 variable=y(:) bytes=819200
144- create CUDA data bytes=819200 file=.../axpy.F90 function=axpy line=31 device=0 threadid=1
145- alloc CUDA data devaddr=0x7fff0b800000 bytes=819200 file=.../axpy.F90 function=axpy line=31 device=0 threadid=1
146- create CUDA data bytes=128 file=.../axpy.F90 function=axpy line=31 device=0 threadid=1
147- alloc CUDA data devaddr=0x7fff0b3c2200 bytes=512 file=.../axpy.F90 function=axpy line=31 device=0 threadid=1
148- upload CUDA data file=.../axpy.F90 function=axpy line=31 device=0 threadid=1 variable=descriptor bytes=128
149- upload CUDA data file=.../axpy.F90 function=axpy line=31 device=0 threadid=1 variable=x(:) bytes=819200
150- Leave enter data .../axpy.F90 axpy:31 device=0 threadid=1
151- launch CUDA kernel file=.../axpy.F90 function=axpy line=31 device=0 host-threadid=0 num_teams=0 thread_limit=0 kernelname=nvkernel_MAIN__F1L31_2_ grid=<<<800,1,1>>> block=<<<128,1,1>>> shmem=0b
152- Enter exit data construct file=.../axpy.F90 function=axpy line=31 device=0 threadid=1
153- download CUDA data file=.../axpy.F90 function=axpy line=35 device=0 threadid=1 variable=x(:) bytes=819200
154- delete CUDA data devaddr=0x7fff0b800000 bytes=819200 file=.../axpy.F90 function=axpy line=35 device=0 threadid=1
155- delete CUDA data devaddr=0x7fff0b3c2200 bytes=512 file=.../axpy.F90 function=axpy line=35 device=0 threadid=1
156- download CUDA data file=.../axpy.F90 function=axpy line=35 device=0 threadid=1 variable=y(:) bytes=819200
157- delete CUDA data devaddr=0x7fff0b2fa000 bytes=819200 file=.../axpy.F90 function=axpy line=35 device=0 threadid=1
158- delete CUDA data devaddr=0x7fff0b3c2000 bytes=512 file=.../axpy.F90 function=axpy line=35 device=0 threadid=1
159- Implicit wait file=.../axpy.F90 function=axpy line=35 device=0 threadid=1 queue=acc_async_sync
160- Leave exit data .../axpy.F90 axpy:35 device=0 threadid=1
161- Using N = 102400
162- Input:
163- a = 3.0000
164- x = 0.0000 0.0000 0.0000 0.0000 ... 1.0000 1.0000 1.0000 1.0000
165- y = 0.0000 0.0010 0.0020 0.0029 ... 99.9971 99.9980 99.9990 100.0000
166- Output:
167- y = 0.0000 0.0010 0.0020 0.0030 ... 102.9970 102.9980 102.9990 103.0000
78+ For ` NVCOMPILER_ACC_NOTIFY=$((0x1F)) ` outputs are in ` axpy_{c,f}_nv_debug_max.out ` .
16879
16980 Compiling with diagnostics:
17081
17182 nvc -O3 -mp=gpu -gpu=cc80 -Minfo=mp axpy.c -o axpy.x
17283 nvfortran -O3 -mp=gpu -gpu=cc80 -Minfo=mp helper_functions.F90 axpy.F90 -o axpy.x
17384
174- Output for C:
175-
176- main:
177- 27, #omp target teams distribute parallel for
178- 27, Generating "nvkernel_main_F1L27_2" GPU kernel
179- 31, Loop parallelized across teams and threads(128), schedule(static)
180- 27, Generating implicit map(tofrom:y[:],x[:])
181-
182- Output for Fortran:
183-
184- axpy:
185- 31, !$omp target teams distribute parallel do
186- 31, Generating "nvkernel_MAIN__F1L31_2" GPU kernel
187- 31, Generating implicit map(tofrom:y(:),x(:))
85+ Outputs are in ` axpy_{c,f}_nv_info.out ` .
0 commit comments