Course Overview:Most machine/deep learning systems require heavy computational resources and a long time to train and deploy. Many of these algorithms can be accelerated using GPU programming. The goal of this the course is to cover state of the art in GPU accelerated machine learning systems.
The microcredit course will first cover the theoretical basics of machine learning and deep learning algorithms with emphasis on their prospects of GPU based executions. Then several technologies based on CUDA systems on GPUs will be explained. The course will focus on practical aspects of implementation with actual demonstration with examples. Hand-on session on the Paramshakti supercomputer will be conducted.
Syllabus: Fundamentals of GPU Architecture & CUDA Introduction to Accelerated Data Science: RAPIDS, Introduction to Machine Learning Algorithms Case Study/Hands-on: Solving and Benchmarking End to End Data Science Problem using RAPIDS Introduction to Deep Neural Network & Deep Learning, NVIDIA CUDA-X Platform Overview: Accelerated Computing for Deep Neural Networks Accelerating and Scaling Deep Neural Networks using DALI, Mixed Precision and Multi-GPU Scaling Optimizing and Deployment of Neural Networks using TensorRT & Triton Inference Server
Sparse matrices: discretization of differential equations, storage schemes for sparse matrices, permutations and reorderings, direct solution methods
Iterative methods and convergence: sor, gradient search methods: steepest descent, conjugate gradient algorithm, krylov subspaces methods: arnoldi's method, gmres, symmetric lanczos algorithm, convergence analysis, block krylov methods, preconditioning techniques, ilu factorization preconditioners, multigrid methods
Domain decomposition: schwarz algorithms and the schur complement, graph partitioning: geometric approach, spectral techniques
Parallel computing: architectures for parallel computing, shared and distributed memory performance metrics, parallelization of simple algorithms
Mpi and openmp: basic mpi and openmp calls parallelizing matrix solvers using domain decomposition;
CUDA: gpgpu architecture thread algebra for matrix operations accelerating matrix solvers using cuda
Introduction to hpc architecture and parallel programming: basic architecture and organization: memory hierarchy, shared and distributed memory architectures, multiprocessor architecture, introduction to thread level parallelism, accelerators (gpu, xeon-phi), performance prediction and evaluation, parallel programming/computing: introduction to mpi/ openmp, basics of cuda programming, optimizing cluster operation: running jobs in hpc environment, job scheduler, cluster level load balancing
Special methods for studying complex systems: basics of statistical mechanics, potential energy surface, introduction to molecular mechanics, simulation methods: molecular dynamics and monte carlo simulations, enhanced sampling methods, coarse-grain modeling
Applications to complex systems: open-source software: md and mc simulation packages, parallelization in software: domain/spatial decomposition, distribution of non-bonded interactions, dynamic load balancing, multiprocessor communication, modeling of soft matter systems such as biomolecules, polymers, carbon nanostructures etc., Computation of thermodynamic, kinetic and mechanical properties of different complex systems