Useful Machine Learning resources¶

This is a piece of documentation trying to offer some advice on tools to use to answer common problems (non ML expert) users might face.

AI4EOSC webinars¶

The AI4EOSC project has organized a series of webinars on the use of the platform (based on the AI4OS software stack), AI, machine learning, deep learning, image processing, image segmentation and other relevant topics. These can be accessed on YouTube at the following links:

Warning

Please, be aware that video demos can become quickly outdated. In case of doubt, always refer to the written documentation.

Tutorials¶

Here are some basic resources to get you quickly started in the Deep Learning / Machine Learning world.

Books¶

Deep Learning with Python, F. Chollet
The FastAI book
Deep Learning Book, Ian Goodfellow

Courses¶

Datasets¶

Dataset labeling¶

Some tools to help you getting started creating your dataset.

CVAT - Image annotation tool (with integration with the Segment Anything Model)
LabelStudio - General annotation (text, images, etc)
LabelImg - Image annotation
refinery - Labeling for NLP
superintendent - ipywidget-based interactive labelling tool for your data.
VGG Image Annotator (VIA) - Image annotation
Biigle - Web based annotation and exploration of images and videos
Roboflow - only free is your dataset is public
Labelbox - paid tool (free with educational license)

Find a dataset¶

If you don’t have any data, try find an open dataset that suits you.

Explore your dataset¶

Less make sure the dataset does not contain errors.

Google’s Know your data - only valid for common Tensorflow Datasets
Sweetviz - explore and compare tabular data
cleanlab - dataset cleaning
FastDup - dataset cleaning. Find anomalies, duplicate and near duplicate images, clusters of similarity, broken images, image statistics, wrong labels.
deepchecks - checks related to various types of issues, such as model performance, data integrity, distribution mismatches, and more.
kangas - exploring, analyzing, and visualizing large-scale multimedia data
Impyute - missing data

Feature selection¶

Some times less is more. Learn how to select the appropriate features of your dataset.

Imbalanced learning¶

Do you have too much data from one class and too few from others. Let’s balance things out!

Sklearn imbalanced

Data augmentation¶

Do you have few data? Make the most out of it!

Augly - General augmentation (text, images, etc.)
imgaug - Image augmentation

Dataset shift¶

Is your dataset likely to degrade over time (eg. cam gets dirty). Keep on eye on it!

Frouros
Alibi-detect
Avalanche - Continual Learning library based on Pytorch
River - Online learning
Cinnamon
Eurybia

Model development¶

If you want to develop a model from scratch don’t try to be a hero! Papers with Code gathers top performing models for multiple tasks with their corresponding code. Reuse them for your usecases! Try not to look for the top model but for the one with the cleanest code.

If you want nevertheless develop your model from scratch here are some recommendations.

Other¶

Computing¶

Some useful non-AI packages to run computations:

numba - see @jit decorator
cython
numpy - important: Install OPENBLAS with Numpy to accelerate computation
pandas
xarray - work better with multidimensional array by labelling dimensions
numexpr - accelerate Numpy computations
intelex - Intel extension to accelerate sklearn
dask - parallel computation
fugue - execute Python, pandas, and SQL code on Spark, Dask and Ray without rewrites
FAISS - efficient similarity search and clustering of dense vectors

GPU acceleration¶

Some packages to accelerate non-AI operations with GPUs.

pycuda
triton - simple high performance GPU programming (openai)

You can use GPU based alternatives of common libraries for faster performance:

cudf - alternative to Pandas
cuml - alternative to sklearn
cusignal - alternative to scipy signal
cugraph - for graph algorithms
cupatial- for geospatial operations
cuxfilter - accelerate visualization (Bokeh, DataShader, Panel, Falcon, Jupyter)

Training monitoring¶

Let’s keep an eye on the training status.

Tensorboard - only works with Tensorflow
TensorboardX - framework agnostic
LabML

Training debugging¶

Is your training failing for some reason?

Netron - visualize DL models
Cockpit - debug training

Model optimization¶

Do you need your model to go faster?

VoltaML - accelerate ML models with a single line of code
sparse-ml
deep-sparse
Pytorch quantization
AItemplate - transforms deep neural networks into CUDA (NVIDIA GPU) / HIP (AMD GPU) C++ code for lightning-fast inference serving
Hummingbird - transform traditional Ml models (eg. Random Forest) to neural networks, and benefit from hardware acceleration