EDGE AI POD

Powering Intelligence: Anaflash's Revolutionary AI Microcontroller with Embedded Flash Memory

EDGE AI FOUNDATION

Memory bottlenecks, not computational limitations, are the true barrier holding back Edge AI. This revelation lies at the heart of Anaflash's revolutionary approach to intelligent edge computing – a breakthrough AI microcontroller with embedded flash memory that transforms how we think about power efficiency and cost in smart devices.

The team has engineered a solution that addresses the two fundamental challenges facing Edge AI adoption: power efficiency and cost. Their microcontroller features zero-standby, power-weight memory with 4-bit per-cell embedded flash technology seamlessly integrated with computation resources. Unlike traditional non-volatile memory options that demand extra processing steps and offer limited storage density, this technology requires no additional masks and scales efficiently.

At the core of this innovation is the Near Memory Computing Unit (NMCU), which establishes a tight coupling with flash memory through a wide I/O interface on a single chip. This architecture eliminates the need to fetch data from external memory after booting or waking from deep sleep – a game-changing feature for battery-powered devices. The NMCU's sophisticated three-part design enhances parallel computations while minimizing CPU intervention: control logic manages weight addresses and buffer flow, 16 processing elements share weights through high-bandwidth connections, and a quantization block efficiently converts computational results.

Fabricated using Samsung Foundry's 28nm standard logic process in a compact 4 by 4.5 mm² die, the microcontroller delivers impressive results. Testing with MNIST and Deep Auto Encoder models demonstrates accuracy levels virtually identical to software baselines – over 95% and 0.878 AUC respectively. The overstress-free waterline driver circuit extends flash cell margins, further enhancing reliability and performance.

Ready to transform your Edge AI applications with technology that combines unprecedented efficiency, performance, and cost-effectiveness? Experience the future of intelligent edge computing with Anaflash's embedded flash microcontroller – where memory and computation unite to power the next generation of smart devices.

Send us a text

Support the show

Learn more about the EDGE AI FOUNDATION - edgeaifoundation.org

Speaker 1:

Our team at Anaflash has developed an AI microcontroller with embedded flash memory. We will discuss why we designed it and how it benefits the Edge AI application. The Edge AI application. This is the outline we're going to present today. I will start with a brief background before explaining our design. In particular, I will cover the Neo-Memory Computing Unit, which I will refer to as NMCU, and the waterline driver, which I will refer to as WL driver for embedded flash. Finally, I will present the silicon test result and conclude the presentation.

Speaker 1:

To power the intelligent edge AI is indispensable. However, two key challenges stand in the way power, efficiency and cost. For example, customers want AI models with more computations and higher accuracy, with minimizing the need for frequent battery changes. Every intelligent edge device requires non-volatile memory to store AI parameters and a high-performance yet efficient AI processor. Our technology delivers both, enabling smarter, more sustainable edge computing. Traditional non-volatile memory, such as STT-MNM, demands extra processing steps, increasing cost, while offering only 1 bit per cell capacity. Our solution overcomes these limitations with a seamless, comfortable embedded flash memory that requires no additional masks and scales. Efficiently requires no additional masks and scales. Efficiently delivering a high, efficient and high cost. Efficient and high performance alternative.

Speaker 1:

Let's go on to the architecture of microcontroller. In most edge AI applications, memory bottlenecks are more common than computing bottlenecks advanced packaging technology, including GPUs and memory chips in a single chip to enhance their high-speed interface. To tackle the challenges, we present an AI microcontroller with zero-standby, power-weight memory featuring 4-bit per-cell embedded flash technology which is seamlessly integrated with an accumulated calculators. This is why we call it near memory computing unit, a key. The key benefit of this architecture is tightly coupled with a wide I-O interface between the NMCU and flash memory on a single chip, enabling efficient AI processing. Since FLASH is a non-volatile memory, there is no need to fetch data from external memory after booting or waking up from deep sleep. This feature makes it a strong candidate for many edge AI solutions. We implemented all of this using a standard logic process without any additional steps.

Speaker 1:

Next, let's take a closer look at how the NMCU is operating. The NMCU, based on embedded flash memory, consists of three main components. Embedded flash memory consists of three main components. First, the control logic manages weight addresses in the embedded flash memory and regulates the input-output buffer flow, allowing Mac operations to be executed in a single instruction without CPU intervention during processing. Second, 16 processing elements enhance parallel computations by sharing weights transmitted from each bank of the eight embedded flash macros through a high bandwidth I-O interface. Finally, the last stage block quantizes the 32-bit Mac operations result into 8-bit activation values to reduce the memory size. The control logic is responsible for controlling the flow of ML operations. It automatically manages the addresses of the embedded flash where the weights parameters are stored, as well as the initiation of read operations, thereby minimizing CPU intervention during ML operations. By minimizing CPU intervention during ML operations. Additionally, it allows selecting input from the input buffer and ping pong buffer and includes logic for storing the final activation values from the quantization logic into the ping pong buffer.

Speaker 1:

Ping pong buffer here is the PE clusters that consist of 16 vector processors called process elements. Each PE is implemented with digital logic based on an adder tree and can perform 8 by 1 vector operations in a single cycle. To support partial sum operations, each PE includes a 32-bit accumulation buffer. The quantization logic is a hardware logic designed to quantize 32-bit computation results of the PE cluster into 8-bit format. It supports the relative activation function, along with 32-bit scale and bias, and it can perform shift operations to extract computation results in 8-bit format by combining the 32-bit format. By combining the 32-bit scale and shift operations, fixed-point arithmetic can be implemented. This is how the operations of each block so far are represented in the time domain.

Speaker 1:

First, after transmitting the input data to the buffer, the computation begins. At this point the weight addresses are sent to each bank of the embedded flash where the weight data is stored. After 30 cycles the data reaches the processing element. Since the MAC operations start only when the required weight data, the processing element does not begin computations during the initial 30 cycles needed for weight fetching. However, given that the program times of embedded flash memory is significantly longer than its read time, making it read-only during inference, we designed the system so that, if the same address has been fetched, the MAC operations can start immediately. Can start immediately. The logic for the MAC unit has been optimized to complete computations within 28 cycles, excluding the one cycle needed for quantization and one another cycle needed for write-back from the 30 cycles for weight fetching.

Speaker 1:

In this study, the logic was designed to perform 128 meg operations over 16 cycles. To minimize the memory movement of the input data, we introduced a ping pong buffer of the input data. We introduced the ping pong buffer To start DNN model layer 1 calculation. The input buffer is used for MAC operation and the output value is written back to a ping pong buffer A. The next step when performing MAC operations using the ping-pong buffer, the NMCU automatically considers the data from the buffer where the previous result was written back as new input data and sends it to the processing element. For layers with input sizes that do not exceed the ping-p pong buffer size, computation can begin immediately without additional data transfers, enabling efficient execution For next-layer calculation. Ping pong buffer connections are changed. This makes it possible to continue MEC operations without stop.

Speaker 1:

I will talk about overstress-free double-rail driver circuit published in JSTC 2023. Developed for embedded flash. Has M-MOS only pass to charge from VRE to WL. Because of the VTH drop and body effect of the M-MOS only path, the verified read-rebel for state S15 among 16 QLC embedded flesh cell status has to be much less than VDDH. This results in limited embedded flesh cell margin. To address this issue, we propose an overstress-free WL driver specifically developed for QH embedded flash. Differently from the conventional WL drivers circuit, we have separate boosted supply lines such as VPS1234 and VPS123 lines. Also, pmos charging circuit from VLD has been added to extend S15 verified read reference level up to VDDH. Different level of VPS and VPP are generated by standard logic comfortable high voltage generator separately. Now we are going to check the experimental result.

Speaker 1:

Here is the die photograph of our design, measuring 4 by 4.5 square millimeters and fabricated using Samsung Foundry's 28 nanometer standard logic process. It features two blocks of 4-bit per cell embedded flash to enhance the wide IO interface. Additionally, a tightly coupled NMCU is positioned next to the flash to accelerate parallel computations, while RISC-V processor handles additional operations, along with SRAM and other components. Compared to other works, this design features a unique 4B-per-cell non-volatile memory without process overhead. This slide shows the measurement result of overstress-free WL driver circuit for QLE embedded flash. Measured waveforms show that the selected PWL and WWL lines switch well to the different VLD levels from 0.5 to 2.5 volt successfully. This is the retention measurement result of the fabricated microcontroller chip running AI inference task.

Speaker 1:

Both MNIST and Deep Auto Encoder were tested. You can find that the weight distribution according to the threshold voltage For MNIST case the inference accuracy shows over 95%. It achieves nearly same as software baseline. For deep autoencoder case, it achieves the same accuracy of software baseline as 0.878 AUC. We did a benchmark test by DeepOtto encoder model from TinyMF-Hoff. Our cycle accurate simulate result shows better performance compared to other design. Let's go on to the summary. Embedded non-volatile memory is essential for smart edge device to reduce power consumption. Conventional embedded non-volatile memory is expensive, while density is not good, so we propose an AI microcontroller with embedded flash which is fabricated in 28 standard logic. Thank you for all support for this work.