HardCilk: Cilk-like Task Parallelism for FPGAs

High-level synthesis (HLS) helps to develop hard-ware accelerators for field-programmable gate arrays (FPGAs) using C/C++ descriptions. HLS is tailored to exploit instruction-level parallelism and, where available, data-level parallelism in applications. Yet, some applications mostly display another type of parallelism known as task-level parallelism (TLP): they may have massive amounts of available parallelism but paralleliz-able execution threads follow starkly different control paths or make completely independent memory accesses. Alas, there is very limited support for TLP in HLS tools (with much of the support being for statically scheduled coarse-grained tasks) whereas TLP is widely supported on conventional CPU platforms via libraries like OpenCilk, Intel Threading Building Blocks, and OpenMP. In this paper, we introduce a framework for supporting software-like TLP on FPGAs. The framework provides a parameterized architectural template that implements all hardware modules needed to support TLP primitives and task management. The emphasis is on providing programmers with a software-like experience: limited hardware resource constraints should only impact performance and never functionality. For this, all queues in the limited BRAMs are virtually extended to larger HBM/DDR FPGA memory and potentially beyond. We provide an open-source Chisel hardware generator that creates a dedicated task management system from a given Cilk-like application description. It is then straightforward to integrate it with HLS-based processing elements to realize full TLP-enabled applications on FPGAs. In the evaluation, we focus on the efficiency and scalability of our hardware task management system and compare it with OpenCilk, a recent software TLP framework. The results show that the architectural building blocks consume a small percentage of resources on modern FPGAs designed for data centres and achieve linear hardware scalability. For a reasonable range of parameters, the system shows near-perfect efficiency and speedup scales linearly when increasing the number of processing elements.