Frequently Asked Questions (FAQ)ΒΆ

This page gathers know issues of the platform, along with possible solutions. If your issue does not appear here, please contact support.

πŸ”₯ Service X is not workingΒΆ

Check the Status page to see if there is any maintenance action going on. If you don’t see anything, wait a couple of hours to make sure it is not a temporary issue.

If the issue persists, please contact support.

πŸ”₯ The Dashboard says I only have 500 MB of disk in my deploymentΒΆ

In your deployment information, you might see that under Disk memory your deployment has 500 MB assigned, which is much less than what you might have asked initially.

For the time being, this number is meaningless, because we are not enforcing correctly the disk limits. Users have access to all the resources of the node, and they might conflict with other users disk space. This is why we kindly ask users to respect a maximum of 20 GB of disk usage per deployment.

We are planning to fix this issue in the new cluster we are setting up.

If you need more than 20 GB, please check the provided option of accessing your dataset via a virtual filesystem, in order to avoid overloading the disk.

πŸ”₯ I ran out of disk in my deploymentΒΆ

The current Nomad cluster has not the ability to properly isolate disk between different users using the same physical machine. So it might be the case that some user might be using more resources than their due share, thus consuming the disk of other users that share their same node. We are planning to fix this issue in the new cluster we are setting up.

First, make sure to delete files in the Trash (/root/.local/share/Trash/files). Files end up there when deleted from the JupyterLab UI, thus not freeing up the space correctly.

In the meantime, if you are sure that you are using less than 20 GB of disk, but you still find that there is not disk left, please contact support.

If you need more than 20 GB, please check the provided option of accessing your dataset via a virtual filesystem, in order to avoid overloading the disk.

πŸ”₯ I cannot access /storageΒΆ

You try to access β€œ/storage” and you get the message:

root@226c02330e9f:/srv# ls /storage
ls: reading directory '/storage': Input/output error

This probably means that you have entered the wrong credentials when configuring your deployment in the Dashboard.

You will need to delete the current deployment and make a new one. Follow our guidelines on how to get an RCLONE user and password to fill the deployment configuration form.

πŸ”₯ rclone fails to connectΒΆ

You tried to connect to connect to the AI4OS Nextcloud and you are returned the following error message:

root@b79d4b9279f6:/srv# rclone about rshare:
2024/02/05 13:26:56 Failed to create file system for "rshare:": the remote url looks incorrect. Note that nextcloud chunked uploads require you to use the /dav/files/USER endpoint instead of /webdav. Please check 'rclone config show remotename' to verify that the url field ends in /dav/files/USERNAME

This is due to a change in endpoints introduced in RCLONE 1.63.X:

old endpoint: https://share.services.ai4os.eu/remote.php/webdav/
new endpoint: https://share.services.ai4os.eu/remote.php/dav/files/<USER>

So you are experiencing this because you are running RCLONE with version higher than 1.62.

To fix this run the following command which will overwrite your endpoint:

$ echo export RCLONE_CONFIG_RSHARE_URL=${RCLONE_CONFIG_RSHARE_URL//webdav\/}dav/files/${RCLONE_CONFIG_RSHARE_USER} >> /root/.bashrc

More info on how to configure rclone.

πŸ”₯ My deployment does not correctly list my resourcesΒΆ

The deployments in the platform are created as Docker containers. Therefore some resources might not be properly virtualized like in a traditional Virtual Machine. This means that standard commands for checking up resources might give you higher numbers than what is really available (ie. they give you the resources of the full Virtual Machine where Docker is running, not the resources avaible to your individual Docker container).

Standard commands:

  • CPU: lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('

  • RAM memory: free -h

  • Disk: df -h

Real available resources can be found with the following commands:

  • CPU: printenv | grep NOMAD_CPU will show both reserved cores (NOMAD_CPU_CORES) and maximum CPU limit (in MHz) (NOMAD_CPU_LIMIT).

  • RAM memory: echo $NOMAD_MEMORY_LIMIT or cat /sys/fs/cgroup/memory/memory.limit_in_bytes

  • Disk: β³πŸ”§ we are working on properly limiting disk space, for the time being we ask you to kindly stick to the 20-25 GB quota .

It is your job to program your application to make use of these real resources (eg. load smaller models, load less data, etc). Failing to do so could potentially make your process being killed for surpassing the available resources. For example, check how to limit CPU usage in Tensorflow or Pytorch.

γ…€γ…€ More info

For example trying to allocate 8GB in a 4GB RAM machine will lead to failure.

root@2dc9e20f923e:/srv# stress -m 1 --vm-bytes 8G
stress: info: [69] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
stress: FAIL: [69] (415) <-- worker 70 got signal 9
stress: WARN: [69] (417) now reaping child worker processes
stress: FAIL: [69] (451) failed run completed in 6s

πŸ”₯ My GPU just disappeared from my deploymentΒΆ

You try to list to GPU and it doesn’t appear:

$ nvidia-smi
Failed to initialize NVML: Unknown Error"

This is due to this issue. It should get fixed when we upgrade the GPU drivers, and this is planned for the next Nomad cluster we are setting up.

In the meantime, your best option is to delete your deployment and create a new one.

πŸ”₯ I delete my deployment but it keeps reappearingΒΆ

Happens from time to time, for unknown reasons. We are removing those dangling deployments daily. If your deployments remains undeleted for more than a day, please contact support.

Hopefully this will be magically fixed in the new cluster we are setting up with the upgraded Nomad version.

πŸš€ I would like to suggest a new featureΒΆ

We are always happy improve our software based on user feedback.

Please open an issue in the Github repo of the component you are interested in:

If you think the documentation itself can be improved, don’t hesitate to open an issue or submit a Pull Request.

You can always check that your suggested feature is not on the Upcoming features list.