Vectorise phase space sampling (port x_to_f_arg to cudacpp with SIMD and GPU support). Just a placeholder. This clearly seems to be a bottleneck in DY+3jets, see #943