Real-time Machine Learning Inference
Machine learning is the force behind new services that leverage natural voice interaction and image recognition to deliver seamless social media or call-center experiences. Moreover, with their ability to identify patterns or exceptions in vast quantities of data related to large numbers of variables, trained deep-learning neural networks are also transforming the way we go about scientific research, financial planning, running smart-cities, programming industrial robots, and delivering digital business transformation through services such as digital twin and predictive maintenance.
Whether the trained networks are deployed for inference in the Cloud or in embedded systems at the network edge, most users’ expectations call for deterministic throughput and low latency. Achieving both simultaneously, within practicable size and power constraints, requires an efficient, massively parallel compute engine at the heart of a system architected to move data efficiently in and out. This requires features such as a flexible memory hierarchy and adaptable high-bandwidth interconnects.
Contrasting with these demands, the Graphics Processing Unit (GPU) based engines typically used for training neural networks – which takes time and many teraFLOPS of compute cycles – have rigid interconnect structures and memory hierarchy that are not well suited to real-time inference. Problems such as data replication, cache misses, and blocking commonly occur. A more flexible and scalable architecture is needed to achieve satisfactory inferencing performance.
Leading Projects Leverage Configurability
Field Programmable Gate Arrays (FPGAs) that integrate optimized compute tiles, distributed local memory, and adaptable, non-blocking shared interconnects can overcome the traditional limitations to ensure deterministic throughput and low latency. Indeed, as machine learning workloads become more demanding, cutting-edge machine learning projects such as Microsoft’s Project BrainWave are using FPGAs to execute real-time calculations cost-effectively and with extremely low latency that has proved to be unachievable using GPUs.
Another advanced machine learning project, by global compute-services provider Alibaba Cloud, chose FPGAs as the foundation to build a Deep Learning Processor (DLP) for image recognition and analysis. FPGAs enabled the DLP to achieve simultaneous low latency and high performance that the company’s Infrastructure Service Group believes could not have been realized using GPUs.
Figure 1 shows results from the team’s analysis with a ResNet-18 deep residual network that shows how the FPGA-based DLP achieves latency of just 0.174 seconds: 86% faster than a comparable GPU case. Throughput measured in Queries Per Second (QPS) is more than seven times higher.
Projects such as Microsoft’s BrainWave and Alibaba’s DLP have successfully established new hardware architectures capable of accelerating AI workloads. This is just the beginning of the journey that will ultimately make machine learning acceleration widely available to Cloud-services customers, as well as industrial users and the automotive community who are more often seeking to deploy machine learning inference in embedded systems at the network edge.
On the other hand, some service providers are keen to infuse machine learning into existing systems to enhance and accelerate established use cases. Examples include network security, where machine learning enhances pattern recognition to drive high-speed detection of malware and dangerous exceptions. Other opportunities include using machine learning applications such as facial recognition or disturbance detection to help smart cities run more smoothly.
AI Acceleration for Non-FPGA Experts
Xilinx has established an ecosystem of resources that let users take advantage of the potential of FPGAs to accelerate AI workloads in the Cloud or at the edge.
Among the tools available, ML-Suite (Figure 2) takes care of compiling the neural network to run in Xilinx FPGA hardware. It can work with neural networks generated by common machine learning frameworks including TensorFlow, Caffe, MxNet, and others. A Python API makes interacting with the ML-Suite easy.
Because machine learning frameworks tend to generate neural networks based on 32-bit floating-point arithmetic, ML-Suite contains a quantizer tool that converts it to a fixed-point equivalent that is better suited to being implemented in an FPGA. The quantizer is part of a set of middleware, compilation and optimization tools, and a runtime, collectively called xfDNN, which ensure the neural network delivers the best possible performance in FPGA silicon.
The ecosystem also leverages Xilinx’s acquisition of DeePhi Technology by utilizing the DeePhi pruner to remove near-zero weights and compress and simplify network layers. The DeePhi pruner has been shown to increase neural network speed by a factor of 10 and significantly reduce system power consumption without harming overall performance and accuracy.
When it comes to deploying the converted neural network, ML-Suite provides xDNN custom processor overlays that abstract designers from the complexities of FPGA design and utilize the on-chip resources efficiently. Each overlay typically comes with its own optimized instruction set for running various types of neural networks. Users can interact with the neural network via RESTful APIs, while working within their preferred environment.
For on-premises deployments, Xilinx Alveo™ accelerator cards remove hardware development challenges and simplify infusing machine learning with existing applications in the data center.
The ecosystem supports machine learning deployment in embedded or edge use cases leveraging not only the pruner but also a quantizer, compiler, and runtime from DeePhi Technology to create high-performing and efficient neural networks suitable for resource-constrained embedded hardware (figure 3). Turnkey hardware such as the Zynq™ UltraScale™9 card and Zynq 7020 System-on-Module simplify hardware development and accelerate software integration.
There are also a number of innovative independent software vendors who have built CNN inference overlays that can be deployed to FPGAs.
Mipsology has built Zebra, a Convolutional Neural Network (CNN) inference accelerator that can easily replace CPU or GPU and supports a number of standard networks (ie Resnet50, InceptionV3, Caffenet) as well as custom frameworks and has demonstrated amazing throughput at lowest latency, such as Resnet50 at 3,700 images/second.
Omnitek DPU is another example of an inference overlay that runs very high performance Deep Neural Networks (DNNs) on an FPGA. For example, on GoogLeNet Inception-v1 CNN, the Omnitek DPU performs inference on 224×224 images using 8-bit integer processing at over 5,300 inferences per second on a Xilinx Alveo Data center accelerator card.
Reconfigurable Compute for Future Flexibility
In addition to the challenges associated with ensuring the required inferencing performance, developers deploying machine learning must also bear in mind that the entire technological landscape around machine learning and artificial intelligence is changing rapidly; today’s state-of-the-art neural networks could be quickly superseded by newer, faster networks that may not fit well with legacy hardware architectures.
At present, commercial machine learning applications tend to be focused on image handling and object or feature recognition, which are best handled using convolutional neural networks. This could change in the future as developers leverage the power of machine learning to accelerate tasks such as sorting through strings or analyzing unconnected data. Workloads like these are better served by other types of neural networks such as Random Forest or Long Short-Term Memory (LSTM) networks. If the hardware must be updated to host different types of neural networks needed to ensure fast compute times with low latency, this could take months or years.
Building an inference engine based on processors such as GPUs or custom Application-Specific Integrated Circuits (ASICs), which have a fixed architecture, leaves no easy or fast way to update the hardware. The pace of AI development is currently outstripping that of silicon, so a custom ASIC that may represent the state of the art at the beginning of its development will be outdated even before it is ready to deploy.
In contrast, the reconfigurability of FPGAs and the sheer flexibility to customize the resources are key strengths that enable these devices to keep pace with the evolution of this exciting field. We already know that FPGAs are well suited to low-latency clustering used for unsupervised learning, which is another emerging branch of AI and particularly well suited to tasks such as statistical analysis.
Using a tool such as ML-Suite to optimize and compile the network for FPGA deployment allows developers to work at a high level in their own environment without needing FPGA expertise to direct the compiler’s decisions, while retaining the flexibility to reconfigure the hardware in the future to support later generations of neural networks.
FPGAs are known to provide the performance acceleration and future flexibility that machine learning practitioners need; not only to build high-performing and efficient inference engines for immediate deployment, but also to adapt with the rapid changes in both the technology and market demands for machine learning. The challenge is to make the architectural advantages of FPGAs accessible to machine learning specialists and at the same time help ensure the best performing and most efficient implementation. Xilinx’s ecosystem has combined state-of-the-art FPGA tools with convenient APIs to let developers take full advantage of the silicon without having to learn the finer points of FPGA design.