Frequently Asked Questions (FAQ)¶
This page gathers know issues of the platform, along with possible solutions. If your issue does not appear here, please contact support.
Hardware issues¶
🔥 The Dashboard shows there are free GPUs but my deployment is still queued¶
This can happen sometimes when a GPU gets stuck in the system and is not correctly freed.
Please contact support if this happens to you!
🔥 I ran out of disk in my deployment¶
You are trying to download some data but the following error is raised:
RESOURCE_EXHAUSTED: Out of memory while trying to allocate ******** bytes
This means that you have consumed more disk than what you initially requested. You can see your current disk consumption using:
$ df -h | grep overlay
This will show you three values, respectively the Total | Used | Remaining
disk.
To solve this first, make sure to delete files in the Trash (/root/.local/share/Trash/files
). Files end up there when deleted from the JupyterLab UI, thus not freeing up the space correctly.
If you still find you have not enough disk, you have two options:
create a new deployment, requesting more disk in the configuration,
access your Nextcloud dataset files via a virtual filesystem, in order to avoid overloading the disk.
🔥 My deployment does not correctly list my resources¶
The deployments in the platform are created as Docker containers. Therefore some resources might not be properly virtualized like in a traditional Virtual Machine. This means that standard commands for checking up resources might give you higher numbers than what is really available (ie. they give you the resources of the full Virtual Machine where Docker is running, not the resources avaible to your individual Docker container).
Standard commands:
CPU:
lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('
RAM memory:
free -h
Disk:
df -h
Real available resources can be found with the following commands:
CPU:
printenv | grep NOMAD_CPU
will show both reserved cores (NOMAD_CPU_CORES
) and maximum CPU limit (in MHz) (NOMAD_CPU_LIMIT
).RAM memory:
echo $NOMAD_MEMORY_LIMIT
orcat /sys/fs/cgroup/memory/memory.limit_in_bytes
Disk:
df -h | grep overlay
will show you three values, respectively theTotal | Used | Remaining
disk
It is your job to program your application to make use of these real resources (eg. load smaller models, load less data, etc). Failing to do so could potentially make your process being killed for surpassing the available resources. For example, check how to limit CPU usage in Tensorflow or Pytorch.
ㅤㅤ More info
For example trying to allocate 8GB in a 4GB RAM machine will lead to failure.
root@2dc9e20f923e:/srv# stress -m 1 --vm-bytes 8G
stress: info: [69] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
stress: FAIL: [69] (415) <-- worker 70 got signal 9
stress: WARN: [69] (417) now reaping child worker processes
stress: FAIL: [69] (451) failed run completed in 6s
🔥 My GPU just disappeared from my deployment¶
You try to list to GPU and it doesn’t appear:
$ nvidia-smi
Failed to initialize NVML: Unknown Error"
This is due to this issue. We are working on fixing this issue. If this is happening to you, please contact support.
In the meantime, your best option is to backup your data, delete your deployment and create a new one.
Storage issues¶
🔥 I cannot access /storage
¶
You try to access “/storage” and you get the message:
root@226c02330e9f:/srv# ls /storage
ls: reading directory '/storage': Input/output error
This probably means that you have entered the wrong credentials when configuring your deployment in the Dashboard.
You will need to delete the current deployment and make a new one. Follow our guidelines on how to get an RCLONE user and password to fill the deployment configuration form.
🔥 Accessing /storage
runs abnormally slow¶
This happens from time to time due to connectivity issues. If this behavior persists for more than a few days, try creating a new deployment.
If latency is still slow in the new deployment, please contact support.
🔥 I cannot find my dataset under /storage/ai4-storage
¶
This can happen if you are accessing the dataset from several deployments at the same time, and the ls
command hasn’t properly refreshed its index.
To fix this you will need to cd to the folder and run cd . for the ls command to refresh its index (ref). Now you should be able to see your dataset.
🔥 rclone fails to connect¶
You tried to connect to connect to the AI4OS Nextcloud and you are returned the following error message:
root@b79d4b9279f6:/srv# rclone about rshare:
2024/02/05 13:26:56 Failed to create file system for "rshare:": the remote url looks incorrect. Note that nextcloud chunked uploads require you to use the /dav/files/USER endpoint instead of /webdav. Please check 'rclone config show remotename' to verify that the url field ends in /dav/files/USERNAME
This is due to a change in endpoints introduced in RCLONE 1.63.X
:
old endpoint: https://share.services.ai4os.eu/remote.php/webdav/
new endpoint: https://share.services.ai4os.eu/remote.php/dav/files/<USER>
So you are experiencing this because you are running RCLONE with version higher than 1.62.
To fix this run the following command which will overwrite your endpoint:
$ echo export RCLONE_CONFIG_RSHARE_URL=${RCLONE_CONFIG_RSHARE_URL//webdav\/}dav/files/${RCLONE_CONFIG_RSHARE_USER} >> /root/.bashrc
More info on how to configure rclone.
Other issues¶
🔥 Service X is not working¶
Check the Status page to see if there is any maintenance action going on. If you don’t see anything, wait a couple of hours to make sure it is not a temporary issue.
If the issue persists, please contact support.
🚀 I would like to suggest a new feature¶
We are always happy improve our software based on user feedback.
Please open an issue in the Github repo of the component you are interested in:
If you think the documentation itself can be improved, don’t hesitate to open an issue or submit a Pull Request.
You can always check that your suggested feature is not on the Upcoming features list.