11.3.1.Software Environment
The primary APIs supported by the NPU are OpenVX 1.2. Figure 1 presents the software stack for the two inference engines, eIQ TF Lite and Arm NN, which currently support NPU acceleration for i.MX 8M Plus.

Source: NXP
Note:
NN RT: Common library which connects ovxlib.
ovxlib: A wrapper around OpenVX driver to interface NN functionality.
OpenVX driver: Khronos defined for acceleration in computer vision and NN functionality.
|
For the purpose of this application note, the following software environment was used:
Yocto BSP release: i.MX 8M Plus Beta 1 release L5.4.24_2.1.0_MX8MP
For details of eIQ support build image for imx-image-full, check i.MX Yocto Project User's Guide (document IMXLXYOCTOUG).
This Yocto BSP release includes TF Lite 2.1.0. It supports hardware acceleration using Neural Networks API (NNAPI) Delegates.
eIQ TF Lite applications (pre-installed for Yocto images containing eIQ)
TF Lite benchmarking application (/usr/bin/tensorflow-lite-2.1.0/examples/benchmark_model)
TF Lite image classification example (/usr/bin/tensorflow-lite-2.1.0/examples/label_image). This was used as a starting point and modified to demonstrate Warmup Time impact.
Note:
For more details on the benchmark_model and label_image applications, refer to i.MX Linux® User's Guide (document IMXLUG).
|
The following table lists the hardware features relevant for the use case described in this application note.
CPU
|
4 x Cortex-A53 1.8GHz
|
DDR
|
16/32-bit LPDDR4/DDR4/DDR3L
|
AI/ML
|
NN Accel 2.3 TOPS
|
L2 cache
|
512KB with ECC
|
NPU
SRAM (256KB) available for the neural network engine. Some of its features are:
Provide Intelligent caching mechanism for kernel and input tensor.
Pre-determine the best caching allocation.
Guarantee no cache thrashing.
Multiple layers of kernel can be stored at the same time.
Store intermediate tensors.
Intermediate tensors are often broken down into smaller tile to reduce memory footprint.
For more details, refer to i.MX 8M PLUS APPLICATIONS PROCESSOR FAMILY (document IMX8MPLUSFS).
11.3.3.CPU versus NPU Performance
Image Classification:
cd /usr/bin/tensorflow-lite-2.6.0/examples
|
Run on CPU
./label_image -m mobilenet_v1_1.0_224_quant.tflite -i grace_hopper.bmp -l labels.txt
|
Results:
INFO: Loaded model mobilenet_v1_1.0_224_quant.tflite
INFO: resolved reporter
INFO: invoked
INFO: average time: 47.289 ms
INFO: 0.764706: 653 military uniform
INFO: 0.121569: 907 Windsor tie
INFO: 0.0156863: 458 bow tie
INFO: 0.0117647: 466 bulletproof vest
INFO: 0.00784314: 835 suit
Run on GPU/NPU acceleration
./label_image -m mobilenet_v1_1.0_224_quant.tflite -i grace_hopper.bmp -l labels.txt -a 1
|
Results:
INFO: Loaded model mobilenet_v1_1.0_224_quant.tflite
INFO: resolved reporter
INFO: Created TensorFlow Lite delegate for NNAPI.
NNAPI delegate created.
INFO: Applied NNAPI delegate.
INFO: invoked
INFO: average time: 2.807 ms
INFO: 0.768627: 653 military uniform
INFO: 0.105882: 907 Windsor tie
INFO: 0.0196078: 458 bow tie
INFO: 0.0117647: 466 bulletproof vest
INFO: 0.00784314: 835 suit
After this comparison, we can see the average CPU runtime is 47.3ms while with NPU acceleration, it only needs 2.8ms.
Benchmarks:
CPU single core
./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite
|
Result:
STARTING!
Log parameter values verbosely: [0]
Graph: [mobilenet_v1_1.0_224_quant.tflite]
Loaded model mobilenet_v1_1.0_224_quant.tflite
Going to apply 0 delegates one after another.
The input model file size (MB): 4.27635
Initialized session in 2.282ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=3 first=184806 curr=179480 min=179480 max=184806 avg=181606 std=2303
Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=179468 curr=179393 min=179220 max=179672 avg=179464 std=108
Inference timings in us: Init: 2282, First inference: 184806, Warmup (avg): 181606, Inference (avg): 179464
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=4.89062 overall=10.9961
CPU running on 4 cores
./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite --num_threads=4
|
Result:
STARTING!
Log parameter values verbosely: [0]
Num threads: [4]
Graph: [mobilenet_v1_1.0_224_quant.tflite]
#threads used for CPU inference: [4]
#threads used for CPU inference: [4]
Loaded model mobilenet_v1_1.0_224_quant.tflite
Going to apply 0 delegates one after another.
The input model file size (MB): 4.27635
Initialized session in 2.228ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=11 first=53805 curr=47928 min=47808 max=53805 avg=48793.2 std=1631
Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=138408 curr=47839 min=47692 max=138408 avg=50667.9 std=14264
Inference timings in us: Init: 2228, First inference: 53805, Warmup (avg): 48793.2, Inference (avg): 50667.9
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=4.84375 overall=10.6602
GPU/NPU Acceleration
./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite --num_threads=4 --use_nnapi=true
|
Result:
STARTING!
Log parameter values verbosely: [0]
Num threads: [4]
Graph: [mobilenet_v1_1.0_224_quant.tflite]
#threads used for CPU inference: [4]
#threads used for CPU inference: [4]
Use NNAPI: [1]
NNAPI accelerators available: [vsi-npu]
Loaded model mobilenet_v1_1.0_224_quant.tflite
INFO: Created TensorFlow Lite delegate for NNAPI.
NNAPI delegate created.
Going to apply 1 delegates one after another.
Explicitly applied NNAPI delegate, and the model graph will be completely executed by the delegate.
The input model file size (MB): 4.27635
Initialized session in 5.111ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=1 curr=5146489
Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=359 first=2797 curr=2717 min=2683 max=2797 avg=2720.98 std=16
Inference timings in us: Init: 5111, First inference: 5146489, Warmup (avg): 5.14649e+06, Inference (avg): 2720.98
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=2.625 overall=30.1836
Results Comparision:
Tests
|
CPU
|
CPU multi-core
|
NPU
|
Image classification
|
47.289ms
|
|
2.807ms
|
Benchmarks
|
179464us
|
48793.2us
|
2720.98us
|