075582814553
Neural Processing Unit (NPU) Explained

FREE-SKY (HK) ELECTRONICS CO.,LIMITED / 09-15 18:39

The embedded neural network processing unit (NPU) adopts a "data-driven parallel computing" architecture and is particularly good at processing massive multimedia data such as video and images.

Neural Processing Unit NPU, Artificial Intelligence AI and Machine Learning ML explained

Catalog

Ⅰ What is an NPU?

Ⅱ Processor modules of NPU

Ⅲ NPU: the core carrier of cell phone AI

Ⅳ NPU vs. GPU

Ⅴ The characteristics of different processing units

Ⅵ Practical applications of NPU

Ⅶ Explanation of each type of processing unit

 

Ⅰ What is an NPU?

It is just at the beginning of the explosion of demand for neural networks and machine learning processing. Traditional CPUs/GPUs can do similar tasks, but NPU specifically optimized for neural networks can perform much better than CPUs/GPUs. Gradually, similar neural network tasks will be done by dedicated NPU units.

NPU (neural processing unit) is a specialized processor for network application packets, using a "data-driven parallel computing" architecture, especially good at processing massive multimedia data such as video and images.

NPU is also a kind of integrated circuit, but different from the single function of special-purpose integrated circuit (ASIC), network processing is more complex, more flexible. Generally, we can use software or hardware in accordance with the characteristics of network computing special programming to achieve the special purpose of the network.

 NPU

NPU

The highlight of the NPU is the ability to run multiple parallel threads -- The NPU is taken to another level with some special hardware-level optimizations, such as providing some easily accessible caching systems for some really different processing cores. These high-capacity cores are simpler than typical "regular" processors because they don't need to perform multiple types of tasks. This set of "optimizations" makes NPUs more efficient, which is why so much R&D is being put into ASICs.

One of the advantages of NPUs is that they spend most of their time focusing on low-precision algorithms, new data flow architectures, or in-memory computing capabilities. Unlike GPUs, they are more concerned with throughput than latency.

Ⅱ Processor modules of NPU

NPU is specially designed for IoT AI to accelerate neural network operations and solve the problem of inefficiency of traditional chips in neural network operations. The NPU processor includes modules for multiplication and addition, activation functions, 2D data operations, decompression, etc.

The multiplication and addition module is used to calculate matrix multiplication and addition, convolution, dot product, and other functions. There are 64 MACs inside the NPU and 32 in the SNPU.

 SoC

The activation function module is used to implement the activation function in the neural network by using the highest 12th order parameter fitting, with 6 MACs inside the NPU and 3 in the SNPU.

The 2D data operation module is used to implement operations on a plane, such as downsampling and plane data copying, etc. There are 1 MAC and 1 SNPU inside the NPU.

The decompression module is used to decompress the weighted data. In order to solve the characteristics of small memory bandwidth in IoT devices, the weights in the neural network are compressed in the NPU compiler, which can achieve a 6-10 times compression effect with almost no impact on accuracy.

Ⅲ NPU: the core carrier of cell phone AI

As we all know, the normal operation of cell phones is inseparable from the SoC chip, which is only the size of a fingernail cover but has all the "guts". Its integrated modules work together to support the implementation of cell phone functions. The CPU is responsible for smooth switching of mobile applications, the GPU supports fast loading of game screens, and the NPU is specifically responsible for the implementation of AI computing and AI applications.

It is also necessary to start with Huawei, which is the first company to use NPU (neural-network processing units) on cell phones, and the first company to integrate NPU into cell phone CPUs.

In 2017, Huawei launched its own architecture NPU. Compared with the traditional scalar and vector computing mode, Huawei's self-developed architecture NPU uses 3D Cube to do acceleration for matrix computing. As a result, the amount of data calculated per unit time is larger, and the AI arithmetic power per unit power consumption is stronger, achieving an order of magnitude improvement compared to traditional CPUs and GPUs, and achieving better energy efficiency.

Huawei first used Cambrian's NPU for the Mate10 using an external approach. A year later, Huawei integrated Cambrian's NPU IP into the 980, and another year later, Huawei abandoned Cambrian to use its own Da Vinci NPU on the 990.

The NPU in Galaxy is also built into the mobile processor to harness advanced neural networks and provide a higher level of visual intelligence for the Galaxy S20/S20 +/S20 Ultra and Z Flip. The NPU powers the scene optimizer, enhancing the ability to recognize content in photos and prompting the camera to adjust it to the ideal settings for the subject. It is also now more accurate than the previous Galaxy model. It also enables the front-facing camera to blur the background of selfies and create bokeh effects. Not only that, but the NPU also helps AI Bixby Vision on the device.

Ⅳ NPU vs. GPU

Although GPU has the advantage in parallel computing capability, it does not work alone and needs the CPU's co-processing. The construction of neural network models and data streams are still carried out on the CPU. There is also the problem of high power consumption and large size. The higher the performance, the larger the GPU, the higher the power consumption, and the more expensive it is, which will not be available for some small devices and mobile devices. Therefore, a small size, low power consumption, high computational performance, and high computational efficiency of a dedicated chip NPU was born.

NPU works by simulating human neurons and synapses at the circuit layer and directly processing large-scale neurons and synapses with a deep learning instruction set, where one instruction completes the processing of a set of neurons. Compared with CPUs and GPUs, NPUs integrate storage and computation through synaptic weights, thus improving operational efficiency.

CPU and GPU processors need to use thousands of instructions to complete the neuron processing. NPU can be completed with just one or a few instructions, so it has obvious advantages in the processing efficiency of deep learning. Experimental results show that the performance of NPU is 118 times that of GPU with the same power consumption.

Ⅴ The characteristics of different processing units

  1. CPU -- 70% of the transistors are used to build Cache, and part of the control unit. Few computational units, suitable for logic control operations.

  2. GPU -- Transistors are mostly used to build computational units, with low computational complexity, suitable for large-scale parallel computing. Mainly used in big data, backend server, image processing.

  3. NPU -- simulate neurons in circuit layer, realize storage and computation integration by synaptic weights. One instruction completes the processing of a group of neurons, improve operation efficiency. Mainly used in the communication field, big data, image processing.

  4. FPGA -- programmable logic, high computational efficiency, closer to the underlying IO. Logic editable through redundant transistors and linkages. Essentially instructionless, no shared memory required, and more computationally efficient than CPU and GPU. Mainly used in smartphones, portable mobile devices, and automobiles.

Ⅵ Practical applications of NPU

  • AI scene recognition by NPU when taking photos, and retouching pictures with NPU computing.

  • NPU judges light source and dark light details to synthesize super night scenes.

  • Realize voice assistant operation by NPU.

  • NPU with GPU Turbo pre-determine the next frame to achieve early rendering to improve the smoothness of the game.

  • NPU pre-determines the touch to improve the following hand and sensitivity.

  • NPU judges the difference of front and back-end network speed demand with Link Turbo.

  • NPU judge the game rendering load to adjust the resolution intelligently.

  • Give the NPU the power saving by reducing the computing load of AI during the game.

  • NPU realizes the dynamic scheduling of CPU and GPU.

  • NPU assists in big data advertising push.

  • Implement AI intelligent word association function of input method through NPU.

Ⅶ Explanation of each type of processing unit

APU: Accelerated Processing Unit, AMD's product for accelerated image processing chip.

BPU: Brain Processing Unit, Horizon's leading embedded processor architecture.

CPU: Central Processing Unit, the mainstream product of PC core.

DPU: Dataflow Processing Unit, an AI architecture proposed by Wave Computing.

FPU: Floating Processing Unit, the floating-point module in general-purpose processors.

GPU: Graphics Processing Unit, with multi-threaded SIMD architecture, made for graphics processing.

HPU: Holographics Processing Unit, a holographic computing chip, and device from Microsoft.

IPU: Intelligent Processing Unit, Deep Mind's investment in Graphcore's AI processor products.

MPU/MCU: Microprocessor/Microcontroller Unit, generally used for low computing applications of RISC computer architecture products, such as ARM-M series processors.

NPU: Neural Network Processing Unit, a general term for a new type of processor-based on neural network algorithm and acceleration, such as the diannao series from the Institute of Computing, Chinese Academy of Sciences/Cambrian.

RPU: Radio Processing Unit radio processor, a collection of Wifi/Bluetooth/FM/processor as a single processor from Imagination Technologies.

TPU: Tensor Processing Unit, a dedicated processor for accelerating artificial intelligence algorithms from Google. The current generation TPU for Inference, the second generation for training.

VPU: Vector Processing Unit, a dedicated chip for accelerating image processing and artificial intelligence launched by Movidius, a company acquired by Intel.

WPU: Wearable Processing, Ineda Systems launched wearable system-on-a-chip products, including GPU/MIPS CPU and other IP.

XPU: Baidu and Xilinx's FPGA intelligent cloud acceleration announced at Hotchips 2017, containing 256 cores.

ZPU: Zylin Processing Unit, a 32-bit open source processor from Norway's Zylin.


Processed in 0.094959 Second , 24 querys.