Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
Applications vary in the degree of instruction level parallelism (ILP) available to be exploited by a superscalar processor. The ILP can also vary significantly within an application. On one end of the microarchitecture space are monolithic superscalar designs that exploit parallelism within an application. At another end of the spectrum are clustered architectures having many simple cores that can be clocked at a higher frequency than a comparable monolithic design. A disadvantage of the clustered design is the cost to transmit results between clusters which potentially limits the performance even in high ILP applications if instructions are not mapped carefully to minimize cross communication. In this paper, we propose an approach that incorporates the strengths of the prior two by clustering multiple integer ALUs in an asymmetric fashion. In one cluster, a 6-way out-of-order pipeline efficiently executes code having high ILP. In another cluster, a simpler, but deeper, 2-way pipeline running at twice the frequency speeds up regions of code having little ILP. We use the parallelism metrics gathered from the dynamic Data Dependence Tracking (DDT) mechanism to dynamically steer instruction windows to a cluster. We demonstrate that the heterogeneous cluster design can improve performance by up to 30% over a monolithic 8- way superscalar processor.
Aurélien François Gilbert Bloch
David Atienza Alonso, Marina Zapater Sancho, Luis Maria Costero Valero, Darong Huang, Ali Pahlevan