Skip to content

MIG devices not found after applying configuration #97

@jungsdao

Description

@jungsdao

I'm using nvidia-mig-parted version 0.8.0 and I have two nvidia A100 80GB PCIe GPUs in my node.
This is my config.yaml file which I have applied.

version: v1
mig-configs:
  all-disabled:
    - devices: all
      mig-enabled: false

  all-enabled:
    - devices: all
      mig-enabled: true
      mig-devices: {}

  all-1g.10gb:
    - devices: all
      mig-enabled: true
      mig-devices:
        "1g.10gb": 7

  all-2g.20gb:
    - devices: all
      mig-enabled: true
      mig-devices:
        "2g.20gb": 3

  all-3g.40gb:
    - devices: all
      mig-enabled: true
      mig-devices:
        "3g.40gb": 2

  all-4g.40gb:
    - devices: all
      mig-enabled: true
      mig-devices:
        "4g.40gb": 1

  all-7g.80gb:
    - devices: all
      mig-enabled: true
      mig-devices:
        "7g.80gb": 1

  custom-config:
    - devices: [0,1,2,3]
      mig-enabled: false
    - devices: [4]
      mig-enabled: true
      mig-devices:
        "1g.10gb": 7
    - devices: [5]
      mig-enabled: true
      mig-devices:
        "2g.20gb": 3
    - devices: [6]
      mig-enabled: true
      mig-devices:
        "3g.40gb": 2
    - devices: [7]
      mig-enabled: true
      mig-devices:
        "1g.10gb": 2
        "2g.20gb": 1
        "3g.40gb": 1

Runnfing this command gives following output probably meaning all-1g.10gb profile has been selected: sudo nvidia-mig-parted export

2024/07/18 13:41:59 WARNING: unable to get device name: [failed to find device with id '20b5']
2024/07/18 13:41:59 WARNING: unable to get device name: [failed to find device with id '20b5']
version: v1
mig-configs:
  current:
  - devices: all
    mig-enabled: true
    mig-devices:
      1g.10gb: 7

but when I run nvidia-smi, I'm having following output with no MIG devices found.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.223.02   Driver Version: 470.223.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  Off  | 00000000:00:08.0 Off |                   On |
| N/A   40C    P0    71W / 300W |     45MiB / 80994MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  Off  | 00000000:00:0B.0 Off |                   On |
| N/A   41C    P0    65W / 300W |     45MiB / 80994MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  No MIG devices found                                                       |
+-----------------------------------------------------------------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Also running sudo nvidia-smi mig -lgip gives following

+-----------------------------------------------------------------------------+
| GPU instance profiles:                                                      |
| GPU   Name             ID    Instances   Memory     P2P    SM    DEC   ENC  |
|                              Free/Total   GiB              CE    JPEG  OFA  |
|=============================================================================|
|   0  MIG 1g.10gb       19     0/7        9.50       No     14     0     0   |
|                                                             1     0     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 1g.10gb+me    20     0/1        9.50       No     14     1     0   |
|                                                             1     1     1   |
+-----------------------------------------------------------------------------+
|   0  MIG 2g.20gb       14     0/3        19.50      No     28     1     0   |
|                                                             2     0     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 3g.40gb        9     0/2        39.25      No     42     2     0   |
|                                                             3     0     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 4g.40gb        5     0/1        39.25      No     56     2     0   |
|                                                             4     0     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 7g.79gb        0     0/1        78.75      No     98     5     0   |
|                                                             7     1     1   |
+-----------------------------------------------------------------------------+
|   1  MIG 1g.10gb       19     0/7        9.50       No     14     0     0   |
|                                                             1     0     0   |
+-----------------------------------------------------------------------------+
|   1  MIG 1g.10gb+me    20     0/1        9.50       No     14     1     0   |
|                                                             1     1     1   |
+-----------------------------------------------------------------------------+
|   1  MIG 2g.20gb       14     0/3        19.50      No     28     1     0   |
|                                                             2     0     0   |
+-----------------------------------------------------------------------------+
|   1  MIG 3g.40gb        9     0/2        39.25      No     42     2     0   |
|                                                             3     0     0   |
+-----------------------------------------------------------------------------+
|   1  MIG 4g.40gb        5     0/1        39.25      No     56     2     0   |
|                                                             4     0     0   |
+-----------------------------------------------------------------------------+
|   1  MIG 7g.79gb        0     0/1        78.75      No     98     5     0   |
|                                                             7     1     1   |
+-----------------------------------------------------------------------------+

I wonder why MIG devices I expected were not created.
I'm getting following error when I try to create GPU instances.

sudo nvidia-smi mig -cgi 19,19,19,19,19,19,19

Unable to create a GPU instance on GPU  0 using profile 19: Insufficient Resources
Failed to create GPU instances: Insufficient Resources

Any help would be appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions