Tomorrow's Edge AI: Cutting-Edge Memory Optimization for Large Language Models with Seonyeong Heo of Kyung Hee University Artwork

EDGE AI POD

Discover the cutting-edge world of energy-efficient machine learning, edge AI, hardware accelerators, software algorithms, and real-world use cases with this podcast feed from all things in the world's largest EDGE AI community.

These are shows like EDGE AI Talks, EDGE AI Blueprints as well as EDGE AI FOUNDATION event talks on a range of research, product and business topics.

Join us to stay informed and inspired!

All Episodes

EDGE AI POD

Tomorrow's Edge AI: Cutting-Edge Memory Optimization for Large Language Models with Seonyeong Heo of Kyung Hee University

December 05, 2024 • EDGE AI FOUNDATION

Discover the cutting-edge techniques behind memory optimization for large language models with our guest, Seonyeong Heo from Kyung-Hee University. Join us as we promise to unlock the secrets of deploying 7-billion-parameter models on small devices with limited memory. This episode delves into the intricacies of key-value caching in decoder-only transformers, a crucial innovation that reduces computational overhead by efficiently storing and reusing outputs. Seon-young shares insightful strategies that tackle the high demands of memory management, offering a glimpse into how these models can be more feasible and energy-efficient.

Our conversation also ventures into the world of dynamic compression methods essential for optimizing memory usage. We unpack the challenges of compressing key-value arrays and explore the merits of techniques like quantization, pruning, and dimensionality reduction with autoencoders. Weighted quantization is highlighted as a standout method for achieving remarkable compression rates with minimal errors, provided it's fine-tuned effectively. This episode is a must-listen for those interested in the future of on-device LLMs, as we underscore the significance of efficient memory management in enhancing their performance, especially in resource-constrained settings. Tune in for this enlightening discussion paving the way for innovative advancements in the field.

Send us a text

Support the show

Learn more about the EDGE AI FOUNDATION - edgeaifoundation.org

Speaker 1: 0:24

Hi Se-young, it's your time. Thanks to join us, and the title of your presentation is Memory Optimization for On-Device LLM, which is a great topic. So the floor is yours.

Speaker 2: 0:40

Okay, hello, I'm Seon-young from Kyung-hee University, so I'm very happy to share my recent investigation on memory optimization for on-device LLM. So before starting my presentation, I would like to briefly introduce myself. So I'm an assistant professor at Kyunghee University, which is located in South Korea. Assistant professor at Kyunghee University, which is located in South Korea. So before joining Kyunghee University, I was a postdoctoral researcher at ETH and research scholar at Virginia Tech. So my research interests include tiny machine learning, especially for memory optimization, and also compiler optimization for machine learning and real-time embedded systems compiler optimization for machine learning and real-time embedded systems. So our goal is to enable large language models on small computing systems, and we all know large language models are large in size. So the most popular models considered for on-device AI is LLMs with 7 billion parameters, which requires around seven gigabytes for their model parameters when they are quantized for 8-bit precision. Since small computing systems have limited memory size around four to eight gigabytes the even 7 billion models hardly fit in their memory. So enabling on-device LLM is like putting a llama inside a refrigerator. However, llms actually need much more memory than that, so in the next slide I will explain exactly for what they need more memory. So at first, I would like to start with the original transformer. So this is the overall structure of transformer first proposed in the NeurIPS 2017 paper. So it basically consists of a pair of an encoder and a decoder, where the encoder extracts the features from the input and the decoder uses the features from both input and output to make a prediction. Unlike the original transformer, many popular generative LLMs, such as GPT and LLAMA, are decoder-only transformers, which directly takes the input and makes a prediction based on the input.

Speaker 2: 3:08

In this talk, we will consider decoder-only transformers. Those decoder-only LLMs are often called autoregressive transformers, in that they generate a new token and append the new token to the input to generate the next token after the token. So, for example, let's assume that the model is given an initial input what is tiny Then the model processes the input and generates the next token. So here the token tiny is chosen as the next token. So then the next token is appended to the input and the model process the input with the new token. So after then they generate the token again after the new token. So the model repeats this process until it generates the maximum number of tokens or encounters the end of tax token. So this recursive behavior is referred to as autoregressive. So this autoregressive behavior requires one forward step per token generation. That means when the model generates n tokens the model has to be processed n times in total. So therefore autoregressive transformers may incur high computational overhead and it makes it difficult to deploy large language models in small computing systems.

Speaker 2: 4:50

Then how to reduce the high computational overhead of large language models? So nowadays most of LLM frameworks support caching, the intermediate outputs of attentions. So looking into the decoder architecture, it consists of a series of multi-head attentions. So as the input gradually grows over time they try to avoid duplicate computations between different forward steps. So the basic computation of attention is illustrated in this slide. So first the attention generates the query, key and value arrays by applying the separate linear layers to the input, and then it multiplies the query and key arrays and softmax, the key QK array, to compute to the attention score. So here softmax is omitted for simplicity, but it is applied to the QK array and the output array is multiplied to the V array. So here the key value cache is introduced to reduce the overhead of obtaining the key and value arrays. So, as explained earlier, the new token is appended to the input every forward step. So therefore, we don't have to calculate the k and v values for the previous tokens. So then, instead of generating the key and value arrays every time, it stores the k and v values in the memory and reuses them for the next forward step. So since the linear layer, which entails matrix multiplication, is generally computationally very expensive, using KV cache can largely reduce the computational overhead.

Speaker 2: 6:56

The memory usage of KV cache can be calculated with this formula the KV cache size is 2 times p, times the number of layers and the number of attention head times and the token dimension and w. So here 2 is multiplied because it has to store k and v, and then p indicates the number of bits for representing the k and v, and then p indicates the number of bits for representing the k and v values. And then also, here w denotes the context window size, which is the number of tokens to be used for generating the next token. So it is also called sequence length. To show how large KVCache is, we measured the memory usage of a couple of large language models on top of the LAMA C++ framework. The graph shows the memory usage of each model for sorting their weights the blue bars and KVCache, the orange bars and the activations yellow bars. So here it is shown that for 7 billion models KV cache can be almost as large as the model parameter size. So in the case of Mistral 7 billion model the size of KV cache is 4 gigabytes. So that is, if we quantize the model for lower precision, then the memory overhead of KVCache would become dominant compared with the other components.

Speaker 2: 8:37

There have been various approaches to reducing the memory overhead of KV cache. One popular approach is to redesign a model with a different type of attention, as illustrated in the figure. Multi-head attention is a typical type of attention used in transformers. Groove query attention and multi-query attention are newly introduced types of attention which can reduce the size of KV cache by using a less number of key and value tokens. So this grouped query attention is already applied to the LAMA2 model, which is one of the new features of the model.

Speaker 2: 9:21

So another approach is to manage KV cache in a page fashion. So in general KV cache is implemented as a ring buffer, but it may suffer from internal or external fragmentation when there are multiple users that share the KV cache. So it can result in underutilization of the memory space. So therefore this SOSP paper proposes to bring the concept of paging to manage KV cache. So it swaps in the cached values into GPU memory when they are needed. So, however, this approach assumes a very large system with multiple users and multiple batches and also a dedicated GPU that has its own memory. So this approach is not appropriate for optimizing on-device LLMs.

Speaker 2: 10:22

So to further reduce the memory usage of KV cache, we tried dynamically compressing KV values before we put the values into the cache so that they consume less memory space. So we investigated a variety of compression methods, so dynamic quantization, dynamic pruning and dynamic dimensionality reduction. So first I would like to discuss the dynamic quantization of KV values. So there are three main types of quantization. The first one is linear quantization, which is the most common type of quantization and it chooses the linearly spaced values for quantization, as illustrated in this graph. And the second one is log quantization, where the gap between two values grows in a log scale. And the last one is weighted quantization, which is relatively newer quantization method than the others. So the method clusters the values to reflect the distribution of values and then use the clustering information for quantization. Clustering information for quantization. So the histogram on the left side illustrates the distribution of KV values for some sample inputs.

Speaker 2: 11:50

So first we can see that the KV has a very wide range of values, from minus 20 to 20. So while convolutional neural networks models usually have a much smaller value range, from 0 to 1 or minus 1 to 1, the deviation of values in KV is relatively large. And also there are not many small values in KV. So we can only see a very thin bar in the histogram around the zero. So we measured the average quantization error to compare different quantization methods. So the results show that the quantization error decreases when the data precision becomes higher. So for the 5-bit quantization it obtains the high compression quantization error and then with the 8-bit quantization it obtains much less quantization error. So when the precision is low, weight quantization obtain much less quantization error than the other method. However, with higher data precision weight quantization does not perform very well. We guess that with high data precision it has to have more clusters for quantization. So maybe it has some difficulty in finding a good clustering from the data.

Speaker 2: 13:37

And we also tried unstructured pruning and structured pruning to compress kV values. In structured pruning the values that are smaller than the pruning threshold are pruned separately. In structured pruning the entire tokens with small values are pruned from the token array and compared to unstructured pruning, structured pruning has an advantage in storing sparse data because it does not need a very complex data structure to memorize the location of each value. So for measuring the compression error we set the target sparsity, which gives a similar compression rate to quantization. So as expected, structured pruning obtained much higher compression error than unstructured pruning. But overall unstructured pruning failed to outperform the quantization method. Structure pruning failed to outperform the quantization method. So, as we saw in the previous slide, the KV arrays have very few small values around 0, so pruning may not work very effectively for compressing their values.

Speaker 2: 14:59

So lastly, we tried reducing the dimension of KV tokens using an autoencoder. So an autoencoder model consists of an encoder and a decoder, where the encoder encodes the input data, lowering its dimension, and the decoder decodes the encoded data to recover the original dimension. Then we can reduce the memory usage of KV cache by storing the encoded data in the middle into the KV cache. So applying the dimensionally reduction dynamically means that we have to run the encoder and decoder in the middle of the model execution. Therefore we have to make sure that the autoencoder should be lightweight enough not to incur large computation and memory overheads.

Speaker 2: 15:57

Then we designed a very small autoencoder which applies 1D convolutions to each token of KV, then it can like halve the dimension of the token. So we can formally define the compression overhead of the autoencoder as follows Assuming that the encoder consists of two one-dimensional convolution layers, the number of floating-point operations is approximately the token dimensions times kernel size, times the number of input and output channels, then this compression overhead has to be lower than the GenV overhead, because if the compression overhead is higher than the GenV overhead then there is no point of compressing KV values. Then there is no point of compressing kv values. So, considering the compression overhead, we designed a very small autoencoder with three one-dimensional convolution layers and measured the compression error for comparison. So, compared to quantization and pruning, the autoencoder suffers from high compression error, although some more optimization may be possible by tuning the hyperparameters of the autoencoder. So here is the comparison table of dynamic compression methods. So in terms of compression error, we saw that the weighted quantization can obtain much lower compression error than the other method. So, conversely, dynamic pruning obtained relatively higher compression error because, as we saw in the histogram, the original sparsity of kv arrays is not that high.

Speaker 2: 18:00

And then, regarding the computational overhead, the dynamic quantization incurs very low overhead because it only has to find the nearest value among the possible values. And then dynamic pruning requires some computational overheads for processing the sparse data because it has to format the sparse data with some sparse data formats such as compressed, sparse load. So it incurs some overhead for processing the sparse data format. And dynamic dimensionality reduction incurs more overhead because it has encode and decode the data in the middle of the execution. And next, the memory overhead for dynamic quantization is almost none because we only need to store a few quantization parameters for dynamic quantization.

Speaker 2: 19:10

And then, in terms of compression rate control, dynamic pruning can allow fine-grained control of sparsity by changing the pruning threshold. And then the dynamic pruning also does not need fine-tuning because it only prunes values that are less than the threshold. However, the dynamic quantization, especially weighted quantization, needs fine-tuning to find the optimal clustering for KV values. So in conclusion, kv cache is introduced to reduce the computational overhead of autoregressive transformers. However, the KV cache can be as large as model size, which is not ideal for small computing systems with limited memory size. So this talk explores various methods for reducing the memory overhead of KV cache for on-device LLMs. This talk shows that weighted quantization can obtain high compression rate with small errors compared with different dynamic compression methods. This is the end of my talk. Thank you for your attention and I will be happy to take any questions.

Speaker 1: 20:43

Thank you so much for your great talk, kacen Young. You know optimizing the memory for llms and and gen I is a super important topics, especially with respect to the energy efficiency, even if I'm not a great expert. Maybe I would like to let you comment a bit about two aspects. One is about the kiwiache and, in particular, the accesses. During your studies, did you observe the correlations that can be exploited in both the accesses and the usage of the cache?

Speaker 2: 21:32

What do you mean by the correlation?

Speaker 1: 21:34

The correlation between, correlation between accesses over the time, or even the values which are accessed.

Speaker 2: 21:47

Ah, so between the over time you mean correlation over time, yeah, of the addresses generated.

Speaker 1: 21:57

For example, you tend to access, during the operation of the LLM, this part of the cache which is very close to another part of the cache, because of the natural, let's say, operations of the LLM. If there are patterns which are privileged.

Speaker 2: 22:18

Yeah, kv cache is not hardware cache, so it's stored in memory. So actually it is accessed sequentially when the model predicts it generates its token. So I think the access pattern of kvcache is quite regular.

Speaker 1: 22:41

so, but um the access can be even predicted ahead of time yes, yes, uh.

Speaker 2: 22:48

So yeah, it can like um accelerate the the access of the memory if we can free fetch the KV cache for future processing.

Speaker 1: 23:01

There is a question from Chris. Do you have an example of compression on a common LLM?

Speaker 2: 23:10

Compression on a common LLM.

Speaker 1: 23:11

Yeah, maybe models from Jürgen's phase on which you tested more or less extensively the effect of compression. I tested and you know the techniques you mentioned during the presentation.

Speaker 2: 23:30

Yeah, I tested the dynamic compression methods with LAMA 2, 7 billion models. So I am considering of open opening my code for public so that you can easily apply the dynamic dynamic compression for for your any like large language models.

Speaker 1: 24:10

So that would be very interesting. Yes for the community, I guess. When do you plan to do that in the near future?

Speaker 2: 24:22

In the near future, maybe before the 2024 ends.

Speaker 1: 24:31

Okay, thank you. Okay, thank you In terms of implementation. Did you also study KiwiCache with respect to memory computing? Are there operations which could be nicely accelerated within the memory.

Speaker 2: 24:59

Okay, Dynamic quantization dynamic pruning.

Speaker 1: 25:02

you know this type of operations that could have a place in some hardware tightly coupled or inserted within the memory itself, even if it's large.

Speaker 2: 25:19

That's a good point because KV cache is for reducing the overhead of computing the K and V values. So I think if the dimension is very large, then it can be processed in memory. The dimension is very large, then it can be processed in memory, processed with in-memory computing without like a heavy, like memory accesses from between CPU and memory. So I think there will. There are many opportunities for processing very large data inside memory. But as far as I know, there are many approaches to apply and accelerate well, matrix and vector multiplication inside the processing memory. So yeah, and also I think the dynamic compression can be done inside the memory if we need to save more, reduced, more computational overhead do you think I'm, I'm trying to you know, to look ahead?

Speaker 1: 26:39

do you need, do you think there are opportunities maybe for other type of lms, to push the dynamic quantization at the extreme? Which means one bit. One bit per sample, per data.

Speaker 2: 26:57

Because I tested five-bit quantization but already it suffered from large quantization error even with the-bit. Because I think for the weight we can apply lower precision quantization but for activation it is more difficult to apply the quantization because of this region of the data, because it depends on the input data, so it is more difficult to predict the characteristics of data. Yeah, I want to more investigate the data distribution to optimize the quantization method, because I think maybe the weighted quantization is not still suboptimal for large language models. So I think there is the next step to forward the research.

Speaker 1: 28:18

Thank you. Another question from Chen Liu. I just asked Arniban. One further minute of patience, please. Chen Liu is asking how do people manage memory of KV cache in hardware like GPU, for example, reserve a maximum token size and date KV cache token by token, or allocate new memory section for each new token of the cache? That's the last question for you.

Speaker 2: 28:53

Yeah, that's a good question. So I tested the measure, the memory usage, on top of llama c plus plus framework. In the case of the framework, it allocates the maximum of the framework. It allocates the maximum token size, the KV cache for the maximum token size. So it reserves large space at the beginning and it gradually puts tokens inside the KV cache cache and then. Yeah, so I think the paper I introduced in the presentation wants to move from the fixed allocation to page allocation to reduce the internal memory fragmentation.

Speaker 1: 29:49

Thank you so much, Chen. It was really appreciated your time and your contribution for today. Thanks a lot also for your availability to answer some questions. Really appreciated this and Jung Bye.