Êtes-vous un étudiant de l'EPFL à la recherche d'un projet de semestre?
Travaillez avec nous sur des projets en science des données et en visualisation, et déployez votre projet sous forme d'application sur Graph Search.
Model compression techniques have lead to a reduction of size and number of computations of Deep Learning models. However, techniques such as pruning mostly lack of a real co-optimization with hardware platforms. For instance, implementing unstructured pruning in dedicated hardware is not a straightforward task, which increases memory and reduces the effective bandwidth usage. Moreover, such pruning algorithms should be adapted to certain hardware requirements, such as the use of tiling. Therefore, in this work, we leverage the use of the Gumbel-Softmax relaxation sampling to structurally prune tiles, which benefits further hardware implementations, and additionally allows to jointly optimize with quantization. Additionally, we show that the combination of different pruning scenarios leads to a larger sparsity. Finally, we demonstrate the benefit of using structured pruning on fine-grained elements (weights) in an FPGA design.
David Atienza Alonso, Miguel Peon Quiros, Pasquale Davide Schiavone, Rubén Rodríguez Álvarez, Denisa-Andreea Constantinescu, Dimitrios Samakovlis, Stefano Albini