Machine learning algorithms require high computing power, usually using a GPU or NPU for acceleration. DEBIX Model A uses NXP i.MX 8M Plus processor, supporting a variety of algorithms accelerated by its CPU, GPU and NPU. As the first processor of i.MX Applications series with an integrated machine learning accelerator, i.MX 8M Plus brings powerful performance for ML application at the edge.
Key parameters of i.MX 8M Plus (for industrial products):
-CPU: 4 x Arm® Cortex®-A53, 1.6GHz
-NPU: 2.3 TOP/s
1. NXP TensorFlow Lite Test on DEBIX
1.1 TensorFlow Lite Test on DEBIX CPU
Test result:
Operation log:
debix@imx8mpevk:/usr/bin/tensorflow-lite-2.9.1/examples$ ./label_image
INFO: Loaded model ./mobilenet_v1_1.0_224_quant.tflite
INFO: resolved reporter
INFO: invoked
INFO: average time: 48.08 ms
INFO: 0.764706: 653 megalith
INFO: 0.121569: 907 wig
INFO: 0.0156863: 458 bookshop
INFO: 0.0117647: 466 broom
INFO: 0.00784314: 835 studio couch
Note: Delegate is a mechanism in TensorFlow Lite that allows specific tasks to be delegated to optimized hardware libraries or backends. XNNPACK Delegate is one of these delegates, a library for accelerating convolution, matrix and deep learning computations on ARM CPUs, so as to improve the inference performance of TensorFlow Lite models.
Test result:
Operation log:
debix@imx8mpevk:/usr/bin/tensorflow-lite-2.9.1/examples$ ./label_image --use_xnnpack=true
INFO: Loaded model ./mobilenet_v1_1.0_224_quant.tflite
INFO: resolved reporter
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
XNNPACK delegate created.
INFO: Applied XNNPACK delegate.
INFO: invoked
INFO: average time: 45.236 ms
INFO: 0.764706: 653 megalith
INFO: 0.121569: 907 wig
INFO: 0.0156863: 458 bookshop
INFO: 0.0117647: 466 broom
INFO: 0.00784314: 835 studio couch
1.3 TensorFlow Lite Test on DEBIX NPU
Test result:
Operation log:
debix@imx8mpevk:/usr/bin/tensorflow-lite-2.9.1/examples$ ./label_image --external_delegate_path=/usr/lib/libvx_delegate.so
INFO: Loaded model ./mobilenet_v1_1.0_224_quant.tflite
INFO: resolved reporter
Vx delegate: allowed_cache_mode set to 0.
Vx delegate: device num set to 0.
Vx delegate: allowed_builtin_code set to 0.
Vx delegate: error_during_init set to 0.
Vx delegate: error_during_prepare set to 0.
Vx delegate: error_during_invoke set to 0.
EXTERNAL delegate created.
INFO: Applied EXTERNAL delegate.
W [HandleLayoutInfer:274]Op 162: default layout inference pass.
INFO: invoked
INFO: average time: 2.581 ms
INFO: 0.768627: 653 megalith
INFO: 0.105882: 907 wig
INFO: 0.0196078: 458 bookshop
INFO: 0.0117647: 466 broom
INFO: 0.00784314: 835 studio couch
1.4 Conclusion
As what we can see from the table above, when we run the label_image script under the TensorFlow Lite, the XNNPACK Delegate slightly increases the running speed compared to only using CPU. And when we use NPU for acceleration, the running speed appears to increase substantially since the runtime decreases from 48.08ms to 2.581ms. And another thing worth noting is that there is no difference in the probability of recognizing megalith, wig, bookshop, broom and studio couch in this process.
2. Self-made Fruit Classification Model Test on DEBIX
2.1 Fruit Classification Model Preparation
There are three labels in this dataset, apple, banana, pitaya, and a total of 1198 images. We use the eIQ tool, select Classification Model’s Performance, and then select NPU. Here we modify the configuration to Input Size:224,224,3, Batch Size:100, Epochs To Train:Infinity, Model Enhancement: Default Enhancement, and all other settings are remained default in eIQ tool. After this process, we start to train the model, and after the training is completed, we need to select validation, in which we can check the recognition rate of each label, so that we can know where the model needs to be optimized. Through adjusting the training parameters of the eIQ tool, correcting the dataset and retraining, we improve the recognition rate of the dataset.
Finally, the dataset meets our requirements and is exported from eIQ to DEBIX. Some information of the exported dataset is as follows:
It can be seen that the training accuracy of the model reaches 97.6% and the validation accuracy reaches 93.24%.
Next, we can use some photos to test its recognition accuracy and running speed.
2.2 Apple Recognition
Photo for test:
2.2.1 Using DEBIX CPU for Apple Recognition
Test result:
Operation log:
root@imx8mpevk:/usr/bin/tensorflow-lite-2.9.1/examples# ./label_image -m mobilenet_v1_1.2_npu_224_fruit.tflite -i apple.bmp
INFO: Loaded model mobilenet_v1_1.2_npu_224_fruit.tflite
INFO: resolved reporter
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: invoked
INFO: average time: 136.018 ms
INFO: 0.992188: 0 apple
INFO: 0.0078125: 1 banana
Running the classification model on DEBIX CPU to detect an apple, it is found that the recognition rate reaches 0.992188, and the runtime is 136.018ms.
2.2.2 Using XNNPACK Delegate for Apple Recognition
Test result:
Operation log:
root@imx8mpevk:/usr/bin/tensorflow-lite-2.9.1/examples# ./label_image -m mobilenet_v1_1.2_npu_224_fruit.tflite -i apple.bmp --use_xnnpack=true
INFO: Loaded model mobilenet_v1_1.2_npu_224_fruit.tflite
INFO: resolved reporter
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
XNNPACK delegate created.
INFO: Applied XNNPACK delegate.
INFO: invoked
INFO: average time: 67.031 ms
INFO: 0.992188: 0 apple
INFO: 0.0078125: 1 banana
When we use XNNPACK Delegate, the recognition rate remains 0.992188, and the runtime is shortened nearly by half, 67.031ms.
2.2.3 Using DEBIX NPU for Apple Recognition
Test result:
Operation log:
root@imx8mpevk:/usr/bin/tensorflow-lite-2.9.1/examples# ./label_image -m mobilenet_v1_1.2_npu_224_fruit.tflite -i apple.bmp --external_delegate_path=/usr/lib/libvx_delegate.so
INFO: Loaded model mobilenet_v1_1.2_npu_224_fruit.tflite
INFO: resolved reporter
Vx delegate: allowed_cache_mode set to 0.
Vx delegate: device num set to 0.
Vx delegate: allowed_builtin_code set to 0.
Vx delegate: error_during_init set to 0.
Vx delegate: error_during_prepare set to 0.
Vx delegate: error_during_invoke set to 0.
EXTERNAL delegate created.
INFO: Applied EXTERNAL delegate.
INFO: invoked
INFO: average time: 3.682 ms
INFO: 0.988281: 0 apple
INFO: 0.0078125: 1 banana
INFO: 0.00390625: 2 pitaya
When we use NPU acceleration, the recognition rate remains the same, and it only takes 3.682ms!
2.3 Banana Recognition
Photo for test:
2.3.1 Using DEBIX CPU for Banana Recognition
Test result:
Operation log:
root@imx8mpevk:/usr/bin/tensorflow-lite-2.9.1/examples# ./label_image -m mobilenet_v1_1.2_npu_224_fruit.tflite -i banana.bmp
INFO: Loaded model mobilenet_v1_1.2_npu_224_fruit.tflite
INFO: resolved reporter
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: invoked
INFO: average time: 136.19 ms
INFO: 0.996094: 1 banana
2.3.2 Using XNNPACK Delegate for Banana Recognition
Test result:
Operation log:
root@imx8mpevk:/usr/bin/tensorflow-lite-2.9.1/examples# ./label_image -m mobilenet_v1_1.2_npu_224_fruit.tflite -i banana.bmp --use_xnnpack=true
INFO: Loaded model mobilenet_v1_1.2_npu_224_fruit.tflite
INFO: resolved reporter
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
XNNPACK delegate created.
INFO: Applied XNNPACK delegate.
INFO: invoked
INFO: average time: 69.795 ms
INFO: 0.996094: 1 banana
2.3.3 Using DEBIX NPU for Banana Recognition
Test result:
Operation log:
root@imx8mpevk:/usr/bin/tensorflow-lite-2.9.1/examples# ./label_image -m mobilenet_v1_1.2_npu_224_fruit.tflite -i banana.bmp --external_delegate_path=/usr/lib/libvx_delegate.so
INFO: Loaded model mobilenet_v1_1.2_npu_224_fruit.tflite
INFO: resolved reporter
Vx delegate: allowed_cache_mode set to 0.
Vx delegate: device num set to 0.
Vx delegate: allowed_builtin_code set to 0.
Vx delegate: error_during_init set to 0.
Vx delegate: error_during_prepare set to 0.
Vx delegate: error_during_invoke set to 0.
EXTERNAL delegate created.
INFO: Applied EXTERNAL delegate.
INFO: invoked
INFO: average time: 3.805 ms
INFO: 0.996094: 1 banana
2.4 Pitaya Recognition
Photo for test:
2.4.1 Using DEBIX CPU for Pitaya Recognition
Test result:
Operation log:
rroot@imx8mpevk:/usr/bin/tensorflow-lite-2.9.1/examples# ./label_image -m mobilenet_v1_1.2_npu_224_fruit.tflite -i pitaya.bmp
INFO: Loaded model mobilenet_v1_1.2_npu_224_fruit.tflite
INFO: resolved reporter
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: invoked
INFO: average time: 136.404 ms
INFO: 0.996094: 2 pitaya
2.4.2 Using XNNPACK Delegate for Pitaya Recognition
Test result:
Operation log:
root@imx8mpevk:/usr/bin/tensorflow-lite-2.9.1/examples# ./label_image -m mobilenet_v1_1.2_npu_224_fruit.tflite -i pitaya.bmp --use_xnnpack=true
INFO: Loaded model mobilenet_v1_1.2_npu_224_fruit.tflite
INFO: resolved reporter
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
XNNPACK delegate created.
INFO: Applied XNNPACK delegate.
INFO: invoked
INFO: average time: 60.709 ms
INFO: 0.996094: 2 pitaya
2.4.3 Using DEBIX NPU for Pitaya Recognition
Test result:
Operation log:
root@imx8mpevk:/usr/bin/tensorflow-lite-2.9.1/examples# ./label_image -m mobilenet_v1_1.2_npu_224_fruit.tflite -i pitaya.bmp --external_delegate_path=/usr/lib/libvx_delegate.so
INFO: Loaded model mobilenet_v1_1.2_npu_224_fruit.tflite
INFO: resolved reporter
Vx delegate: allowed_cache_mode set to 0.
Vx delegate: device num set to 0.
Vx delegate: allowed_builtin_code set to 0.
Vx delegate: error_during_init set to 0.
Vx delegate: error_during_prepare set to 0.
Vx delegate: error_during_invoke set to 0.
EXTERNAL delegate created.
INFO: Applied EXTERNAL delegate.
INFO: invoked
INFO: average time: 3.764 ms
INFO: 0.996094: 2 pitaya
2.5 Conclusion
After we use DEBIX CPU, XNNPACK Delegate and NPU for apple, banana and pitaya recognition test respectively, it turns out that the recognition rate of DEBIX CPU, XNNPACK Delegate and NPU is very stable or even the same, and all the recognition rate reaches 0.98 or above.
XNNPACK Delegate can cut the runtime almost by half, while NPU has the most significant acceleration effect, taking only a few milliseconds to complete the recognition test.