Skip to content

训练488万张图片时中间突然出现OS.ERROR #1385

@2575044704

Description

@2575044704

When I was training model on 2x A100 80G machine, a few time later afrer start, there's an error occurred:

steps:   0%|                                                                                         | 373/381280 [1:58:19<2014:00:30, 19.03s/it, avr_loss=0.0848]
steps:   0%|                                                                                         | 374/381280 [1:58:25<2010:03:41, 19.00s/it, avr_loss=0.0848]
steps:   0%|                                                                                         | 374/381280 [1:58:25<2010:03:41, 19.00s/it, avr_loss=0.0848]
steps:   0%|                                                                                         | 374/381280 [1:58:30<2011:31:38, 19.01s/it, avr_loss=0.0848]
steps:   0%|                                                                                         | 374/381280 [1:58:35<2012:59:34, 19.03s/it, avr_loss=0.0848][rank1]: Traceback (most recent call last):
[rank1]:   File "/sd-scripts/sdxl_train_network.py", line 185, in <module>
[rank1]:     trainer.train(args)
[rank1]:   File "/sd-scripts/train_network.py", line 806, in train
[rank1]:     for step, batch in enumerate(train_dataloader):
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/accelerate/data_loader.py", line 458, in __iter__
[rank1]:     next_batch = next(dataloader_iter)
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
[rank1]:     data = self._next_data()
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
[rank1]:     return self._process_data(data)
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
[rank1]:     data.reraise()
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/_utils.py", line 705, in reraise
[rank1]:     raise exception
[rank1]: OSError: Caught OSError in DataLoader worker process 4.
[rank1]: Original Traceback (most recent call last):
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
[rank1]:     data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
[rank1]:     data = [self.dataset[idx] for idx in possibly_batched_index]
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
[rank1]:     data = [self.dataset[idx] for idx in possibly_batched_index]
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/utils/data/dataset.py", line 348, in __getitem__
[rank1]:     return self.datasets[dataset_idx][sample_idx]
[rank1]:   File "/sd-scripts/library/train_util.py", line 1207, in __getitem__
[rank1]:     img, face_cx, face_cy, face_w, face_h = self.load_image_with_face_info(subset, image_info.absolute_path)
[rank1]:   File "/sd-scripts/library/train_util.py", line 1092, in load_image_with_face_info
[rank1]:     img = load_image(image_path)
[rank1]:   File "/sd-scripts/library/train_util.py", line 2352, in load_image
[rank1]:     img = np.array(image, np.uint8)
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/PIL/Image.py", line 696, in __array_interface__
[rank1]:     new["data"] = self.tobytes()
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/PIL/Image.py", line 755, in tobytes
[rank1]:     self.load()
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/PIL/WebPImagePlugin.py", line 160, in load
[rank1]:     data, timestamp, duration = self._get_next()
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/PIL/WebPImagePlugin.py", line 127, in _get_next
[rank1]:     ret = self._decoder.get_next()
[rank1]: OSError: failed to read next frame


steps:   0%|                                                                                         | 374/381280 [1:58:40<2014:26:18, 19.04s/it, avr_loss=0.0848]W0624 22:22:13.858000 140247365268672 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 75699 closing signal SIGTERM
E0624 22:22:14.275000 140247365268672 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 1 (pid: 75700) of binary: /root/.conda/envs/lora/bin/python3
Traceback (most recent call last):
  File "/root/.conda/envs/lora/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/.conda/envs/lora/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/root/.conda/envs/lora/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1027, in <module>
    main()
  File "/root/.conda/envs/lora/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1023, in main
    launch_command(args)
  File "/root/.conda/envs/lora/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
    multi_gpu_launcher(args)
  File "/root/.conda/envs/lora/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
    distrib_run.run(args)
  File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
sdxl_train_network.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-24_22:22:13
  host      : intern-studio-40021203
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 75700)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

I hope the author can find the reason of this problem, thanks!!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions