Developing Peak Memory Shaving heuristics for GPGPU systems


We are developing novel co-scheduling algorithm for servicing multiple DNN inference (and similar AI pipelines) requests simultaneously on a heterogeneous CPU-GPU cluster. In current state-of-the-art, an optimizer produces a schedule for multiple inference pipelines. Such a kernel task activation schedule consists of the start time of each operation within multiple DNN jobs. Such a schedule is optimizing response time of each job, but no consideration is incorporated on the memory budget of the underlying system. This problem is focused on how to modify such a kernel task activation schedule so that at any point of time the amount of memory being utilised is not more than the user specified memory budget.

Input :- DNN job schedule with start time of each kernel operation of jobs of multiple inference pipeline with deadlines of each jobs, memory budget.

output :- New schedule with peak Memory Shaving applied