**Course Overview:**Most machine/deep learning systems require heavy computational resources and a long time to train and deploy. Many of these algorithms can be accelerated using GPU programming. The goal of this the course is to cover state of the art in GPU accelerated machine
learning systems.

The microcredit course will first cover the theoretical basics of machine learning and deep learning algorithms with emphasis on their prospects of GPU based executions. Then several technologies based on CUDA systems on GPUs will be explained. The course will focus on practical aspects of implementation with actual demonstration with examples. Hand-on session on the Paramshakti supercomputer will be conducted.

**Syllabus:**
Fundamentals of GPU Architecture & CUDA
Introduction to Accelerated Data Science: RAPIDS,
Introduction to Machine Learning Algorithms
Case Study/Hands-on: Solving and Benchmarking End to
End Data Science Problem using RAPIDS
Introduction to Deep Neural Network & Deep Learning,
NVIDIA CUDA-X Platform Overview: Accelerated
Computing for Deep Neural Networks
Accelerating and Scaling Deep Neural Networks using
DALI, Mixed Precision and Multi-GPU Scaling
Optimizing and Deployment of Neural Networks using
TensorRT & Triton Inference Server

**Sparse matrices:** discretization of differential equations, storage schemes for sparse matrices, permutations and reorderings, direct solution methods

**Iterative methods and convergence:** sor, gradient search methods: steepest descent, conjugate gradient algorithm, krylov subspaces methods: arnoldi's method, gmres, symmetric lanczos algorithm, convergence analysis, block krylov methods, preconditioning techniques, ilu factorization preconditioners, multigrid methods

Domain decomposition: schwarz algorithms and the schur complement, graph partitioning: geometric approach, spectral techniques

Parallel computing: architectures for parallel computing, shared and distributed memory performance metrics, parallelization of simple algorithms

**Mpi and openmp:** basic mpi and openmp calls parallelizing matrix solvers using domain decomposition;

**CUDA:** gpgpu architecture thread algebra for matrix operations accelerating matrix solvers using cuda

**Introduction to hpc architecture and parallel programming:** basic architecture and organization: memory hierarchy, shared and distributed memory architectures, multiprocessor architecture, introduction to thread level parallelism, accelerators (gpu, xeon-phi), performance prediction and evaluation, parallel programming/computing: introduction to mpi/ openmp, basics of cuda programming, optimizing cluster operation: running jobs in hpc environment, job scheduler, cluster level load balancing

**Special methods for studying complex systems:** basics of statistical mechanics, potential energy surface, introduction to molecular mechanics, simulation methods: molecular dynamics and monte carlo simulations, enhanced sampling methods, coarse-grain modeling

**Applications to complex systems:** open-source software: md and mc simulation packages, parallelization in software: domain/spatial decomposition, distribution of non-bonded interactions, dynamic load balancing, multiprocessor communication, modeling of soft matter systems such as biomolecules, polymers, carbon nanostructures etc., Computation of thermodynamic, kinetic and mechanical properties of different complex systems