Useful Machine Learning resources

This is a piece of documentation trying to offer some advice on tools to use to answer common problems (non ML expert) users might face.

AI4EOSC webinars

The AI4EOSC project has organized a series of webinars on the use of the platform (based on the AI4OS software stack), AI, machine learning, deep learning, image processing, image segmentation and other relevant topics. These can be accessed on YouTube at the following links:


Here are some basic resources to get you quickly started in the Deep Learning / Machine Learning world.


  • Deep Learning with Python, F. Chollet

  • The FastAI book

  • Deep Learning Book, Ian Goodfellow



Dataset labeling

Some tools to help you getting started creating your dataset.

  • CVAT - Image annotation tool (with integration with the Segment Anything Model)

  • LabelStudio - General annotation (text, images, etc)

  • LabelImg - Image annotation

  • refinery - Labeling for NLP

  • superintendent - ipywidget-based interactive labelling tool for your data.

  • VGG Image Annotator (VIA) - Image annotation

  • Biigle - Web based annotation and exploration of images and videos

  • Roboflow - only free is your dataset is public

  • Labelbox - paid tool (free with educational license)

Find a dataset

If you don’t have any data, try find an open dataset that suits you.

Explore your dataset

Less make sure the dataset does not contain errors.

  • Google’s Know your data - only valid for common Tensorflow Datasets

  • Sweetviz - explore and compare tabular data

  • cleanlab - dataset cleaning

  • FastDup - dataset cleaning. Find anomalies, duplicate and near duplicate images, clusters of similarity, broken images, image statistics, wrong labels.

  • deepchecks - checks related to various types of issues, such as model performance, data integrity, distribution mismatches, and more.

  • kangas - exploring, analyzing, and visualizing large-scale multimedia data

  • Impyute - missing data

Feature selection

Some times less is more. Learn how to select the appropriate features of your dataset.

Imbalanced learning

Do you have too much data from one class and too few from others. Let’s balance things out!

Data augmentation

Do you have few data? Make the most out of it!

  • Augly - General augmentation (text, images, etc.)

  • imgaug - Image augmentation

Dataset shift

Is your dataset likely to degrade over time (eg. cam gets dirty). Keep on eye on it!


Model development

If you want to develop a model from scratch don’t try to be a hero! Papers with Code gathers top performing models for multiple tasks with their corresponding code. Reuse them for your usecases! Try not to look for the top model but for the one with the cleanest code.

Training monitoring

Let’s keep an eye on the training status.

Training debugging

Is your training failing for some reason?

Model optimization

Do you need your model to go faster?

  • VoltaML - accelerate ML models with a single line of code

  • sparse-ml

  • deep-sparse

  • Pytorch quantization

  • AItemplate - transforms deep neural networks into CUDA (NVIDIA GPU) / HIP (AMD GPU) C++ code for lightning-fast inference serving

  • Hummingbird - transform traditional Ml models (eg. Random Forest) to neural networks, and benefit from hardware acceleration