ICode9

精准搜索请尝试: 精确搜索
首页 > 其他分享> 文章详细

Common Issues and their Solutions

2021-03-27 13:03:09  阅读:401  来源: 互联网

标签:training nnUNet their Common error GPU Net nnU Issues


Common Issues and their Solutions

RuntimeError: Expected scalar type half but found float

This can happen when running inference (or training) with mixed precision enabled on older GPU hardware. It points
to some operation not being implemented in half precision for the type of GPU you are using. There are flags to enforce
the use of fp32 for both nnUNet_predict and nnUNet_train. If you run into this error, using these flags will probably
solve it. See nnUNet_predict -h and nnUNet_train -h for what the flags are.

nnU-Net gets ‘stuck’ during preprocessing, training or inference

nnU-Net uses python multiprocessing to leverage multiple CPU cores during preprocessing, background workers for data
augmentation in training, preprocessing of cases during inference as well as resampling and exporting the final
predictions during validation and inference. Unfortunately, python (or maybe it is just me as a programmer) is not
very good at communicating errors that happen in background workers, causing the main process to indefinitely wait for
them to return indefinitely.

Whenever nnU-Net appears to be stuck, this is what you should do:

  1. There is almost always an error message which will give you an indication of what the problem is. This error message
    is often not at the bottom of the text output, but further up. If you run nnU-Net on a GPU cluster (like we do) the
    error message may be WAYYYY off in the log file, sometimes at the very start of the training/inference. Locate the
    error message (if necessary copy the stdout to a text editor and search for ‘error’)

  2. If there is no error message, this could mean that your OS silently killed a background worker because it was about
    to go out of memory. In this case, please rerun whatever command you have been running and closely monitor your system
    RAM (not GPU memory!) usage. If your RAM is full or close to full, you need to take action:

    • reduce the number of background workers: use -tl and -tf in nnUNet_plan_and_preprocess (you may have to
      go as low as 1!). Reduce the number of workers used by nnUNet_predict by reducing --num_threads_preprocessing and
      --num_threads_nifti_save.
    • If even -tf 1 during preprocessing is not low enough, consider adding a swap partition located on an SSD.
    • upgrade your RAM! (32 GB should get the job done)

nnU-Net training: RuntimeError: CUDA out of memory

This section is dealing with error messages such as this:

RuntimeError: CUDA out of memory. Tried to allocate 4.16 GiB (GPU 0; 10.76 GiB total capacity; 2.82 GiB already allocated; 4.18 GiB free; 4.33 GiB reserved in total by PyTorch)

This message appears when the GPU memory is insufficient. For most datasets, nnU-Net uses about 8GB of video memory.
To ensure that you can run all trainings, we recommend to use a GPU with at least 11GB (this will have some headroom).
If you are running other programs on the GPU you intend to train on (for example the GUI of your operating system),
the amount of VRAM available to nnU-Net is less than whatever is on your GPU. Please close all unnecessary programs or
invest in a second GPU. We for example like to use a low cost GPU (GTX 1050 or slower) for the display outputs while
having the 2080ti (or equivelant) handle the training.

At the start of each training, cuDNN will run some benchmarks in order to figure out the fastest convolution algorithm
for the current network architecture (we use torch.backends.cudnn.benchmark=True). VRAM consumption will jump all over
the place while these benchmarks run and can briefly exceed the 8GB nnU-Net typically requires. If you keep running into
RuntimeError: CUDA out of memory problems you may want to consider disabling that. You can do so by setting the
--deterministic flag when using nnUNet_train. Setting this flag can slow down your training, so it is recommended
to only use it if necessary.

nnU-Net training in Docker container: RuntimeError: unable to write to file </torch_781_2606105346>

Nvidia NGC (https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) is a great place to find Docker containers with
the most recent software (pytorch, cuDNN, etc.) in them. When starting Docker containers with command provided on the
Nvidia website, the docker will crash with errors like this when running nnU-Net: RuntimeError: unable to write to file </torch_781_2606105346>. Please start the docker with the --ipc=host flag to solve this.

Downloading pretrained models: unzip: cannot find zipfile directory in one of /home/isensee/.nnunetdownload_16031094034174126

Sometimes downloading the large zip files containing our pretrained models can fail and cause the error above. Please
make sure to use the most recent nnU-Net version (we constantly try to improve the downloading). If that does not fix it
you can always download the zip file from our zenodo (https://zenodo.org/record/4003545) and use the
nnUNet_install_pretrained_model_from_zip command to install the model.

Downloading pre-trained models: unzip: 'unzip' is not recognized as an internal or external command OR Command 'unzip' not found

On Windows systems and on a bare WSL2 system, the unzip command may not be present.
Either install it, unzip the pre-trained model from zenodo download, or update to a newer version of nnUNet that uses the Python build in
(https://docs.python.org/3/library/zipfile.html)

nnU-Net training (2D U-Net): High (and increasing) system RAM usage, OOM

There was a issue with mixed precision causing a system RAM memory leak. This is fixed when using cuDNN 8.0.2 or newer,
but the current pytorch master comes with cuDNN 7.6.5. If you encounter this problem, please consider using Nvidias NGC
pytorch container for training (the pytorch it comes with has a recent cuDNN version). You can also install the new
cuDNN version on your system and compile pytorch yourself (instructions on the pytorch website!). This is what we do at DKFZ.

nnU-Net training of cascade: Error seg from prev stage missing

You need to run all five folds of 3d_lowres. Segmentations of the previous stage can only be generated from the
validation set, otherwise we would overfit.

nnU-Net training: RuntimeError: CUDA error: device-side assert triggered

This error often goes along with something like void THCudaTensor_scatterFillKernel(TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, Real, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = -1]: block: [4770,0,0], thread: [374,0,0] Assertion indexValue >= 0 && indexValue < tensor.sizes[dim] failed..

This means that your dataset contains unexpected values in the segmentations. nnU-Net expects all labels to be
consecutive integers. So if your dataset has 4 classes (background and three foregound labels), then the labels
must be 0, 1, 2, 3 (where 0 must be background!). There cannot be any other values in the ground truth segmentations.

If you run nnUNet_plan_and_preprocess with the --verify_dataset_integrity option, this should never happen because
it will check for wrong values in the label images.

nnU-Net training: Error: mmap length is greater than file size and EOFError

Please delete all .npy files in the nnUNet_preprocessed folder of the test you were trying to train. Then try again.

running nnU-Net on Azure instances

see https://github.com/MIC-DKFZ/nnUNet/issues/437, thank you @Alaska47

标签:training,nnUNet,their,Common,error,GPU,Net,nnU,Issues
来源: https://blog.csdn.net/h201601060805/article/details/115263335

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有