Tomasz Kloda
Real-time scheduling of dynamic neural networks on multi-Edge TPU
"Integrating deep neural network (DNN) accelerators in real-time systems, where tasks are subject to timing constraints and must be executed at a specific rate or in response to recurrent events, needs a scheduling model that leverages their parallelism and limits the reprogramming overhead.
Edge Tensor Processing Unit (TPU) is a DNN accelerator, designed by Google, that exploits the systolic array architecture to accelerate matrix multiplications and convolutions, which are core DNN operations. Several Edge TPUs can be connected into a pipeline: when one TPU finishes processing a sample, it saves the intermediate result to the input buffer of the next TPU and starts processing a new incoming sample. Besides the higher throughput, the benefit of the TPU pipeline is that the large DNN models can be divided into multiple segments and spread across multiple TPUs, making better use of their on-chip memories and thus avoiding data fetch from external memory. Our benchmarks show that with the increasing number of TPUs in the pipeline, the inference time can decrease until the point where most of the model fits the on-chip memories and the inter-TPU communication overhead starts to become high.
In this talk, I will present our gang scheduling techniques designed to run a set of different DNN workloads on multiple Edge TPUs. I will also describe our strategy to avoid unbounded priority-inversion and set task parallelism levels to guarantee that deadlines will be met."
Edge Tensor Processing Unit (TPU) is a DNN accelerator, designed by Google, that exploits the systolic array architecture to accelerate matrix multiplications and convolutions, which are core DNN operations. Several Edge TPUs can be connected into a pipeline: when one TPU finishes processing a sample, it saves the intermediate result to the input buffer of the next TPU and starts processing a new incoming sample. Besides the higher throughput, the benefit of the TPU pipeline is that the large DNN models can be divided into multiple segments and spread across multiple TPUs, making better use of their on-chip memories and thus avoiding data fetch from external memory. Our benchmarks show that with the increasing number of TPUs in the pipeline, the inference time can decrease until the point where most of the model fits the on-chip memories and the inter-TPU communication overhead starts to become high.
In this talk, I will present our gang scheduling techniques designed to run a set of different DNN workloads on multiple Edge TPUs. I will also describe our strategy to avoid unbounded priority-inversion and set task parallelism levels to guarantee that deadlines will be met."
back to overview
Biography
"Dr. Tomasz Kloda, INSA Toulouse / LAAS-CNRSTomasz Kloda is an Assistant Professor at INSA Toulouse, Department of Electrical and Computer Engineering and Laboratory for Analysis and Architecture of Systems (LAAS-CNRS) in France. His research focuses on real-time scheduling and embedded systems. He received his PhD from the University of Toulouse and was a postdoc at the Technical University of Munich, the University of Modena and Reggio Emilia, and Inria Paris."