Publications associées à Processeur 128 bits

ColTraIn: Co-located DNN training and inference

Deep neural network inference accelerators are deployed at scale to accommodate online services, but face low average load because of service demand variability, leading to poor resource utilization. Unfortunately, reclaiming inference idle cycles is diffi ...

EPFL2020

Training DNNs with Hybrid Block Floating Point

Babak Falsafi, Martin Jaggi, Tao Lin, Mario Paulo Drumond Lages De Oliveira

The wide adoption of DNNs has given birth to unrelenting computing requirements, forcing datacenter operators to adopt domain-specific accelerators to train them. These accelerators typically employ densely packed full-precision floating-point arithmetic t ...

NEURAL INFORMATION PROCESSING SYSTEMS (NIPS)2018

Training DNNs with Hybrid Block Floating Point

Babak Falsafi, Martin Jaggi, Tao Lin, Mario Paulo Drumond Lages De Oliveira

The wide adoption of DNNs has given birth to unrelenting computing requirements, forcing datacenter operators to adopt domain-specific accelerators to train them. These accelerators typically employ densely packed full-precision floating-point arithmetic t ...

2018

Cyme: A Library Maximizing SIMD Computation on User-Defined Containers

Felix Schürmann, Timothée Ewart

This paper presents Cyme, a C++ library aiming at abstracting the usage of SIMD instructions while maximizing the usage of the underlying hardware. Unlike similar efforts such as Boost.simd or VC, Cyme provides generic high level containers to the users wh ...

Springer International Publishing2014

Low-Latency Elliptic Curve Scalar Multiplication

Joppe Willem Bos

This paper presents a low-latency algorithm designed for parallel computer architectures to compute the scalar multiplication of elliptic curve points based on approaches from cryptographic side-channel analysis. A graphics processing unit implementation u ...

Springer Verlag2012

High-Performance Modular Multiplication on the Cell Processor

Joppe Willem Bos

This paper presents software implementation speed records for modular multiplication arithmetic on the synergistic processing elements of the Cell broadband engine (Cell) architecture. The focus is on moduli which are of special interest in elliptic curve ...

Springer-Verlag New York, Ms Ingrid Cunningham, 175 Fifth Ave, New York, Ny 10010 Usa2010

High-Performance Modular Multiplication on the Cell Processor

Joppe Willem Bos

This paper presents software implementation speed records for modular multiplication arithmetic on the synergistic processing elements of the Cell broadband engine (Cell) architecture. The focus is on moduli which are of special interest in elliptic curve ...

Springer Berlin Heidelberg2010

A Fast Parallel Matrix Multiplication Reconfigurable Unit Utilized In Face Recognitions Systems

In this paper we present a reconfigurable device which significantly improves the execution time of the most computational intensive functions of three of the most widely used face recognition algorithms; those tasks multiply very large dense matrices. The ...

Ieee Service Center, 445 Hoes Lane, Po Box 1331, Piscataway, Nj 08855-1331 Usa2009

Thermal Balancing Policy for Streaming Computing on Multiprocessor Architectures

Giovanni De Micheli, David Atienza Alonso, Luca Benini

As feature sizes decrease, power dissipation and heat generation density exponentially increase. Thus, temperature gradients in Multiprocessor Systems on Chip (MPSoCs) can seriously impact system performance and reliability. Thermal balancing policies base ...

2008

CORDIC-Based MMSE-DFE Coefficient Computation

Ali H. Sayed

A modular parallel architecture for a MMSE-DFE coefficient computation processor is presented. The architecture is based on QR factorization of a channel-and-noise-dependent data matrix and is implemented using CORDIC processors within a systolic array arc ...

Academic Press1999

Processeur 128 bits

Graph Chatbot

Chattez avec Graph Search

ColTraIn: Co-located DNN training and inference

Training DNNs with Hybrid Block Floating Point

Training DNNs with Hybrid Block Floating Point

Cyme: A Library Maximizing SIMD Computation on User-Defined Containers

Low-Latency Elliptic Curve Scalar Multiplication

High-Performance Modular Multiplication on the Cell Processor

High-Performance Modular Multiplication on the Cell Processor

A Fast Parallel Matrix Multiplication Reconfigurable Unit Utilized In Face Recognitions Systems

Thermal Balancing Policy for Streaming Computing on Multiprocessor Architectures

CORDIC-Based MMSE-DFE Coefficient Computation

ColTraIn: Co-located DNN training and inference

Low-Latency Elliptic Curve Scalar Multiplication

Training DNNs with Hybrid Block Floating Point

Cyme: A Library Maximizing SIMD Computation on User-Defined Containers

High-Performance Modular Multiplication on the Cell Processor

High-Performance Modular Multiplication on the Cell Processor

Training DNNs with Hybrid Block Floating Point

CORDIC-Based MMSE-DFE Coefficient Computation

A Fast Parallel Matrix Multiplication Reconfigurable Unit Utilized In Face Recognitions Systems

Thermal Balancing Policy for Streaming Computing on Multiprocessor Architectures