Practical Edge Computing solutions are taking advantage of technologies such as Neural Networks. Neural Networks require huge computations to get optimized accurately. Edge Computing solutions need solutions that can provide the best in class performance with high-level accuracy. This blog post discusses some of the optimization methods that can be used to maximize the performance of Neural Networks on Edge Computing from performance and accuracy perspectives.
Structural pruning is a technique used for Neural Network optimization where unimportant weights and neurons are removed from the network (pruned) to make the network more efficient. Structural pruning modifies the structure of the network and hence directly impacts the performance of the network. Structural pruning methods can be divided into two categories:
Gradient-Based Pruning - Gradient-based pruning methods use the idea of pruning the weights which have been least important to the loss/accuracy of the network. This is done by calculating the gradient of the loss function w.r.t the weights and iteratively pruning the weights with the smallest gradients.
Manual Pruning - Manual pruning is a technique whereby the weights/neurons of the network are manually selected and pruned from the network based on heuristics such as importance or frequency of usage.
Both gradient-based and manual pruning techniques have been found to be useful for optimizing Neural Networks for Edge Computing and can lead to a drastic reduction in the size and number of computations for Neural Networks.
Another technique for Neural Network optimization for Edge Computing is Weight Quantization. This process involves quantizing the weights a.k.a reducing the number of bits used to represent weights from a floating-point representation (float32, etc.) to a fixed-point representation (uint8, int16, etc.). This allows for more efficient computations as the number of bits used to represent the weights is drastically reduced and hence, the neuro- electrical signals required to represent and manipulate the weights is also reduced.
In addition, Weight Quantization can also lead to improved performance for some applications such as Object Detection, since quantizing the weights directly affects the output of the Neural Network. This can be seen as the output of the Neural Network is now given by fixed point values, leading to more robust results from the Neural Network.
Model Compression is another technique used for Neural Network optimization for Edge Computing. This process involves compressing the model/weights of the Neural Network while maintaining the same (or close to) accuracy. Model Compression can be achieved by various methods such as Model Distillation, Knowledge Distillation, etc.
Model distillation is a technique wherein a large, complex model (teacher model) is used to train a smaller, more efficient model (student model). Knowledge distillation is a more advanced technique wherein the teacher model is used to not just teach the student model, but also to teach the student model how to generalize better. This technique brings in further improvements in accuracy and performance, thus making it ideal for Edge Computing.
There are various techniques that can be used to optimize Neural Networks for Edge Computing. These techniques, such as Structural Pruning, Weight Quantization, and Model Compression, can be used to reduce the size and computational latency of Neural Networks while increasing the accuracy of the models. These techniques are thus ideal for edge computing solutions that require real-time performance.