Reconfigurable Acceleration of Deep Learning Workloads with FPGA-Based Architectures in Edge and Embedded Systems

Lucas Oliveira
Camila Ferreira
Thiago Souza
Gulnaz Rati
Mariana Costa

0 evaluations Published on Jul 15, 2025

This article on Sciety

Abstract

As the proliferation of edge computing reshapes the landscape of artificial intelligence deployment, the demand for efficient, low-latency, and energy-conscious deep learning inference has never been more acute. Traditional acceleration platforms—such as CPUs, GPUs, and ASICs—face increasing challenges in meeting the unique constraints of edge environments, including limited power budgets, thermal envelopes, physical space, and the need for adaptive, context-aware operation. In this context, Field-Programmable Gate Arrays (FPGAs) have emerged as a compelling solution, offering a uniquely reconfigurable architecture that enables tailored hardware acceleration for diverse deep learning workloads. This survey provides a comprehensive and technically rigorous exploration of the design, deployment, and evaluation of scalable deep learning accelerators implemented on FPGAs, with a specific focus on edge and embedded intelligence. We begin by categorizing FPGA-based architectures into distinct paradigms, including streaming dataflow, systolic arrays, overlay-based accelerators, coarse-grained reconfigurable architectures, and hybrid heterogeneous systems, highlighting their respective strengths, limitations, and suitability for different inference scenarios. The discussion proceeds to analyze the critical design methodologies used to transform high-level neural network descriptions into efficient, synthesizable hardware representations. These include high-level synthesis (HLS), tiling strategies, loop unrolling, memory scheduling, and automated compilation flows that align with the unique spatial and temporal characteristics of FPGA fabrics. Special emphasis is placed on the role of quantization, pruning, and precision scaling in reducing hardware complexity while preserving model accuracy. We further examine runtime reconfiguration and adaptive execution techniques, where FPGA fabrics dynamically reshape their logic and compute paths in response to changing workloads, power conditions, or functional requirements. This section delves into partial reconfiguration, adaptive precision, workload-aware scheduling, and emerging methods of using machine learning itself to optimize the reconfigurability of hardware at deployment time. Benchmarking methodologies are also discussed in detail, with a focus on multi-dimensional evaluation metrics—such as throughput, energy per inference, latency, model fidelity, and reconfiguration overhead—under realistic operational constraints. A diverse set of real-world case studies is presented, encompassing applications from autonomous drones and wearable health monitors to industrial vision systems and quantized AI inference engines, showcasing the breadth of FPGA applicability and the concrete engineering challenges involved. Finally, the review synthesizes these findings to outline future research directions and systemic challenges, including the need for end-to-end toolchains, intelligent hardware-software co-design, support for dynamic and sparse models such as transformers, and the integration of FPGA fabrics into chiplet-based and 3D-stacked architectures. We argue that the evolution of FPGAs from peripheral accelerators to central elements of embedded AI systems marks a significant shift in how intelligence is architected and deployed at the edge. Through detailed technical analysis, practical case exploration, and a forward-looking perspective, this work provides a foundational reference for researchers, system architects, and developers aiming to harness the full potential of FPGAs for scalable, efficient, and adaptive deep learning at the intelligent edge.

Related articles are currently not available for this article.