Vectorise phase space sampling (port x_to_f_arg to cudacpp with SIMD and GPU support - starting with sample_get_x?)

Vectorise phase space sampling (port x_to_f_arg to cudacpp with SIMD and GPU support). Just a placeholder. 

This clearly seems to be a bottleneck in DY+3jets, see #943