[User] CUDA is broken

# Prerequisites

Please answer the following questions for yourself before submitting an issue.

- [ X] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [x] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [x] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new bug or useful enhancement to share.


# Current Behavior

Crash when I enable two GPU cards. Gabage output when I enable one GPU card.

Two Cards:
```
PS C:\gpt\llama.cpp> .\build\bin\RelWithDebInfo\main.exe -m ..\en-models\7B\ggml-alpaca-7b-q4.bin -p "what is cuda?" -ngl 40
main: build = 635 (5c64a09)
main: seed  = 1686202333
ggml_init_cublas: found 2 CUDA devices:
  Device 0: Tesla P100-PCIE-16GB
  Device 1: NVIDIA GeForce GTX 1070
llama.cpp: loading model from ..\en-models\7B\ggml-alpaca-7b-q4.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (Tesla P100-PCIE-16GB) as main device
llama_model_load_internal: mem required  = 1932.71 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 layers to GPU
llama_model_load_internal: offloading output layer to GPU
llama_model_load_internal: total VRAM used: 3987 MB
...................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


 what is cuda?CUDA error 9 at C:\GPT\llama.cpp\ggml-cuda.cu:1574: invalid configuration argument
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[User] CUDA is broken #1756

Prerequisites

Current Behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[User] CUDA is broken #1756

Description

Prerequisites

Current Behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions