AI applications in Emerging Smart Systems
"Artificial Intelligence in Computational Imaging Systems"
Prof. Edmund LAM (HKU)
Computational imaging systems involve the co-design of the imaging optics and computational algorithms, which significantly expand the design space and promise to deliver better imaging capabilities and qualities. Image reconstruction is at the core of the computational algorithms. In recent years, artificial intelligence and learning-based methods emerge as powerful complements to the traditional model-based iterative methods. In this talk, we will systematically consider how imaging systems, equipped with both model-based and AI-based computational algorithms, are good examples of “emerging smart systems”, and discuss specific scenarios such as the use of new sensor technologies and the capture of holographic or light field data for 3D reconstruction.
Prof. Edmund LAM
"Low Precision Inference and Training for Deep Neural Networks"
Prof. Philip LEONG (University of Sydney)
The computational complexity of deep learning has led to research efforts to reduce the computation required. The use of low precision is particularly effective on FPGAs as they are not restricted to byte addressable operations.
We first describe our finding that in inference applications, throughput matching with higher precision on certain layers can be used to recover accuracy in low-precision deep neural networks (DNNs). The work is applied to automatic modulation classification of radio signals leveraging the capabilities of the Xilinx ZCU111 RFSoC platform. On the open-source RadioML 2018.01A dataset, we demonstrate how to recover 4.3% in accuracy with the same hardware usage and achieve 488k classifications per second.
In the second part of the talk we introduce Block Minifloat (BM), a new spectrum of minifloat formats capable of training DNNs end-to-end with only 4-8 bit weight, activation and gradient tensors. While standard floating-point representations have two degrees of freedom, via the exponent and mantissa, BM exposes the exponent bias as an additional field for optimization. For ResNet trained on ImageNet, 6-bit BM achieves almost no degradation in floating-point accuracy with FMA units that are 4.1 × (23.9×) smaller and consume 2.3 × (16.1×) less energy than FP8 (FP32).
Prof. Philip LEONG
"Edge Computing for Low-carbon Intelligence"
Prof. Hao YU (Southern University of Science and Technology)
In the low-carbon intelligent era, the optimized models and energy-efficient hardware are required for the deep neural network (DNN) computing process in edge-computing applications. Multi-precision networks are developed rapidly for DNN model optimization. To satisfy the computing requirements with high energy efficiency, a multi-precision convolution accelerator is necessary for edge computing. Other bottlenecks are "flux wall" and "power wall" which are caused by the massive data migration. The traditional architecture cannot handle the deep-learning computing efficiently.
In this talk, the multi-precision accelerators are introduced with supporting multi-precision fixed-point operations (2~8 bits) and floating-point operations (FP16~FP64). Through reconfigurable characteristics, the system achieves high energy-efficient performance. In addition, the computing-in-memory (CIM) systolic accelerator is also provided. It improves the throughput and power performance by the CIM advantages for data migration and the systolic flow for data transfer.
Prof. Hao YU
"Differentiable Dynamic Quantization with Mixed Precision and Adaptive Resolution"
Prof. Ping LUO (HKU)
Model quantization is challenging due to many tedious hyper-parameters such as precision (bitwidth), dynamic range (minimum and maximum discrete values) and stepsize (interval between discrete values). Unlike prior arts that carefully tune these values, we present a fully differentiable approach to learn all of them, named Differentiable Dynamic Quantization (DDQ), which has several benefits. (1) DDQ is able to quantize challenging lightweight architectures like MobileNets, where different layers prefer different quantization parameters. (2) DDQ is hardware-friendly and can be easily implemented using low-precision matrix-vector multiplication, making it capable in many hardware such as ARM. (3) DDQ reduces training runtime by 25% compared to state-of-the-arts. Extensive experiments show that DDQ outperforms prior arts on many networks and benchmarks, especially when models are already efficient and compact. e.g. DDQ is the first approach that achieves lossless 4-bit quantization for MobileNetV2 on ImageNet.
Prof. Ping LUO