Technical Insights
2025年06月04日
LLM Architecture Design and Algorithm Optimization for Edge AI
分享
On May 24, at the AICon Global AI Development & Application Conference's forum on "LLM Architecture Innovation & Edge AI Application," RockAI CEO Liu Fanping delivered a keynote speech titled "LLM Architecture Design and Algorithm Optimization for Edge AI," analyzing the key issues facing edge AI. He highlighted the current mainstream design directions for edge large model architectures and foundational algorithm improvements tailored for edge-side scenarios, while also looking ahead to the development trends of edge AI towards collective intelligence.
The Rise and Challenges of Edge AI
I am pleased to have this closed-door meeting today to discuss and share insights on edge large models and architectural innovations. I am Liu Fanping, the producer of this forum and CEO of RockAI. RockAI is a leading non-transformer architecture large model manufacturer in China, empowering various industries.
Firstly, I would like to talk about the original intention of planning such a thematic event. When the organizing committee invited me, I was actually quite hesitant, but then I thought it was indeed necessary for us to have such communication, because there are deviations in the industry's understanding of edge large models, even very distorted, therefore we believe that the current edge has deviated from the original concept.
Regarding edge AI, I will discuss it from the software perspective, while other speakers will approach it from the hardware side later. This is because edge computing has a characteristic of combining software and hardware, and the model architecture design and algorithm improvement are more software-oriented, which is also the main focus of my research.
Many people currently view the evolution from cloud AI to edge AI as a continuous process. The cloud has developed well, and thus it gradually extends to the edge. But is this true? Has the full development of the cloud spurred the emergence of edge AI? Or has the limitations of cloud AI driven research on the edge AI? Or are edge AI and cloud AI actually two very independent technological directions? This question is worth reconsidering.
If we do not see the essence behind this problem, it is difficult to predict the direction of technological evolution. For example, if we simply believe that cloud AI has matured and therefore develop towards the edge, we may fall into a misconception: the cloud model is so good that it can be deployed and run on the edge by simply scaling down. So edge AI becomes a concept? Actually, it shouldn't be.
Thus, we need to re-examine: Why focus on edge AI? What is the technological direction for edge AI? Based on these reflections, we have defined edge large models, which may differ from everyone's understanding.
In my opinion, there is a tendency to underestimate edge-side models in the industry. Many practitioners simply equate edge-side large models with smaller parameter versions of cloud large models. When it comes to the cloud, we talk about tens of billions, even hundreds of billions (where "B" refers to "billion parameters"), but when discussing edge-side large models, it becomes only 1B, 2B, 3B, or even smaller parameter counts. Is this the correct understanding of edge-side large models? Clearly not. The edge-side large models we define must be capable of private deployment on terminal devices, with their core capability being multimodal perception that enables autonomous learning and memory. Currently, many edge-side models never mention autonomous learning and memory; if the purpose of edge-side deployment is merely to load a smaller parameter version of a model, that is unreasonable.
The edge AI we hope to develop can provide personalized services for each individual while ensuring privacy and security, with the core being autonomous learning and memory capabilities. Taking mobile phones as an example, do the models currently deployed on mobile phones possess autonomous learning and memory capabilities? No, essentially, they are just a smaller parameter version of the cloud-based large model, or a product of the edge-cloud integration, and not a true edge model in the real sense.
It's quite changing to further achieve autonomous learning and memory capabilities while meeting the requirements of low latency, data privacy, offline availability, cost reduction, and personalization. Currently, even cloud large models cannot accomplish this. Why? Because of the transformer architecture. For instance, ChatGPT lately claimed to support memory capabilities based on RAG technology, which records user characteristics during the conversation process (such as the user living in Shanghai and preferring sweet flavors) and incorporates them into prompts. The model utilizes a long context mechanism to achieve what is referred to as memory during conversations with users. Nevertheless, this approach of implementing memory has significant flaws, as it cannot guarantee the accuracy of memory retrieval, which may lead to correct or incorrect selections.
How to achieve autonomous learning and memory capabilities on the edge side? We should integrate these capabilities into the parameters rather than having them exist as "plugins." Nowadays, in the domestic edge model field, apart from RockAI, it is difficult for other manufacturers to achieve this, but it is a crucial step for edge AI's development.
As the scale of the edge AI industry continues to expand, do consumers only expect a smaller parameter version of the cloud large model? Certainly not. Practice indicates that deploying a smaller parameter version of the cloud-based large model on mobile phones is actually not very effective. Take a simple instance, how long can users continuously use today's large models? A week, two weeks, three weeks, a month... very short. For most people, the model is just a tool. In this case, what everyone wants is not just a smaller parameter version of the large model. Beyond basic needs such as privacy and security, there is a greater desire for edge-side devices to possess true autonomous learning and memory capabilities, just as the way humans shape their identity recognition through memory and continuously grow through learning, which is the most meaningful aspect.
The current edge models face multiple challenges such as energy consumption of computing resources, real-time requirements, memory loss, data privacy, and the inability to learn autonomously. However, I think these issues are not daunting, as all problems arise from contradictions and are resolved within them. It is okay that the current models do not possess autonomous learning capabilities; they will progress step by step.
The traditional cloud-based large models are hard to deploy on the edge side, with computing power, memory, runtime delay, and power consumption being typical issues. It's not a good solution simply by trimming and quantizing cloud-based large models for edge AI deployment.
What will large models become when they achieve autonomous learning and memory capabilities? "Make Every Device Its Own Intelligence," this has been our vision since we started our business and accompanys us for many years. Why emphasize "its own intelligence"? Because true intelligence must be one's own, just as human intelligence exists in the brain, and each person's brain is different, so is their intelligence.
Why must cloud-based large models dominate the physical world instead of edge-side devices? Let's make a vivid analogy: every person can be seen as a "terminal," and the brain is the model. Will this model be exactly the same as others? Of course not. Therefore, we hope to "Make every devicenits own intelligence." It's worth noting that OpenAI has recently proposed the idea of deploying cloud models to devices, which is actually aligned with our vision.
I want to emphasize that all models will ultimately return to the devices. If there is no AI directed at physical devices, how many can still exist on smartphones? Even if they do, they are merely tools. When every device has its own intelligence, we will welcome a new "iPhone moment." At that time, new devices may emerge, bringing new intelligence to create new products, which will iterate on existing smartphones and PCs.
There's a fatal flaw when simply transplanting existing models to smartphones and PCs: it merely supplements their application scenarios on existing hardware, while intelligence is advancing and hardware is stagnating. This leads to an irreconcilable contradiction that we cannot do more on existing devices, or we can, but it's not a true breakthrough to change the future.
Architecture Design Directions for Edge-Side Large Models
The basic principles of traditional model optimization are: less computation, faster computation, and energy saving. For instance, improvements in activation functions, data sparsity, model fine-tuning, robustness enhancement, and alternatives to attention mechanisms. Regarding alternatives to attention mechanisms, I researched Transformers relatively early, when systematic studies on Attention had not taken shape, and some papers were introducing related content. We completed practical explorations of linear Attention a long time ago, but the actual results were not as expected. Nowadays, many manufacturers are still improving the attention mechanism.
It's certainly beneficial to carefully consider these optimization methods, such as learning the improvement strategies for linear Attention. But if we step back from these methods, is it possible that we have made a mistake? Like patching up a torn garment; it may seem increasingly perfect, but actually, it has become somewhat unrecognizable. This reflects a crucial message: we have overlooked that the continuous development of technology not only lies in these technical details but also in the macro aspects. Then why has no one pointed out that the transformer architecture has problems? RockAI has, and we have improved the architecture.
Many of you may known the underlying optimization techniques of transformer. In the process of practice, have you thought about the true points of these optimizations? Are there really improvements to transformer? In fact, if you think deeply, you will find many issues are worth discussing.
There is a common sense in the industry that "large models" should run on "small devices," and the mainstream solutions are nothing more than the trilogy of pruning, quantization, and distillation. These technologies are not new-before the rise of deep learning, they already existed during the era of traditional neural network multilayer perceptrons. Hinton's papers from the early 2000s, and even research from the 1990s, discussed distillation techniques. Nowadays, the threshold for these technologies has significantly lowered, and common recruitment requirements such as "pruning experience" and "quantization ability" can be mastered through systematic learning.
I believe this situation is worth discussing. Why must models undergo pruning, quantization, and distillation? Is it to balance accuracy and memory? Where exactly is the problem, and why must FP32 be converted to FP16 or even lower precision? The issue actually lies within the model itself.
The essence of AI is the study of the human brain, and when the brain processes information, it does not perform operations like pruning, quantization, or distillation. What truly limits us is the algorithm. Among the algorithm, data, and computing power, computing power is a core point, so is data, but the most important is the algorithm because our algorithms have not yet approached the intelligent processes similar to those of the human brain, we must put a lot of effort into improving them.
There are various directions for lightweight design, including lightweight attempts of MobileNet and transformer on the edge side, neural network architecture search, and the fusion of CNN and transformer architecture. These technological attempts can serve as individual points of work or be applied to business scenario solutions, and the fusion of Mamba and transformer architectures also shows this trend. The RockAI model differs from these directions because we believe that the method of fusion is not reflected in the human brain itself.
Operator fusion is relatively mature, so there's no need to talk more about it. The mixture of experts (MoE) mechanism is a widely used technical direction today, which was actually proposed in the 1991 paper "Adaptive Mixtures of Local Experts." Hinton's papers in the early 2000s also discussed distillation techniques and the MoE architecture; this approach is not novel, but it has gained acceptance because it provides greater value in the field of large models. One can find theoretical roots for many AI-related technologies and design ideas in historical research; it is not that someone suddenly created something.
The technologies mentioned above are not particularly difficult; with time and effort, one can achieve results. The real challenge lies in the architecture design itself. Next, I would like to share with you the mature application and commercialized model developed by our team, and that is RockAI's Yan architecture large model. Many vendors' models have not been commercialized and remain at the conceptual and promotional level, while our large model has already been commercialized and is currently advancing mass production cooperation with downstream clients.
The Yan architecture large model is based on two fundamental principles: MCSD achieves efficient feature extraction, and this module completely replaces the Attention mechanism of the transformer architecture. The replacement process is not complex, and there are many current studies on linear Attention.
The other fundamental principle is the brain-like activation mechanism, which we have studied for a long time. For example, when thinking about problems, the logical areas of the brain are heavily activated, while they are suppressed during rest; when watching television, the visual cortex is heavily activated, processing a lot of information. Neuroscience has long studied that not all neurons in the human brain are activated simultaneously when processing information.
One of the authors of "Attention is All You Need" once pointed out at the NVIDIA GTC conference that even simple questions like "1+1=?" require massive parameters for computation, which is unreasonable. When humans process these questions, they selectively activate neurons. For instance, when watching a horror movie, one might open their eyes and cover their ears; the visual stimulition evoke fear in the brain, and then closing the eyes to listen to the sounds, the screams also evoke fear, activating the same parts of the neurons. This indicates that different modalities or different sensory methods entering the brain may process information in the same area.
The MoE architecture cannot achieve this; it's quite sparse. The Yan architecture simulates the working mechanism of the human brain, where all neurons are randomly initialized at the beginning and dynamically select which neurons to participate in processing information, forming a dynamic network structure to compute "1+1=?" and "2+2=?". This process significantly reduces the amount of computation, especially the number of parameters, as it does not require activating all neurons.
Let me make a metaphor: there are many bridges from Puxi to Pudong, and MoE selects a suitable bridge from the existing ones for passage, while Yan decides which neurons to participate based on actual needs, dynamically creating a suitable vessel; the more complex the task, the more neurons participate, and vice versa.
This dynamic neuron scheduling greatly reduces computational power requirements and may stimulate the same neurons to handle tasks, avoiding complex modality alignment. Based on this, our Yan architecture multimodal large model can be deployed and run on Raspberry Pi, which transformer architecture cannot achieve.
We mentioned long ago that "the transformer architecture is like a gasoline car, while the new architecture is like a new energy vehicle." Nowadays, the non-transformer approach has gradually become a consensus. This does not deny the excellence of transformer; in fact, they've shown advantages in many fields, but they also have many limitations when solving practical problems. When we first proposed the non-transformer approach, many people expressed confusion, amazed by the excellence of transformer and questioning why we would abandon it. But there have been many such occurrences in the field of AI. Remember the era when SVM (Support Vector Machine) was all the rage? It almost became the first solution for various problems. Later, new technologies such as RNN, CNN, LSTM, and Attention mechanisms emerged in the field of deep learning.
Every major technological breakthrough receives widespread recognition. As practitioners, we are willing and eager to embrace new technologies, but at the same time, we must maintain critical thinking. We need to recognize its advantages, analyze its shortcomings, and discuss how to improve it. If the problem cannot be solved, we need to redesign the architecture.
For all models in the field of AI, we must maintain critical thinking. We are not denying the technology; rather, we recognize that the current models may not necessarily represent the ultimate form of AGI. Since 2023, we have been researching non-transformer, and this year, many people have suggested that transformer are not the way to AGI. In fact, we should have realized this issue from the beginning and not be constrained by transformer. Once we fall into a technological dependency on the transformer architecture, our thinking becomes limited, mistakenly viewing what we are doing as universal, when in fact it is not.
Basic Algorithm Improvement Ideas for Edge-Side Scenarios
Who is really holding back edge-side inference? We, as algorithm developers, would never say it's a chip problem. I started researching large models early on, and I was very familiar with the BERT model when it came out in 2017-2018. Throughout our long exploration, we have been pondering who is really holding us back? Backpropagation. Many people might be surprised that Liu Fanping from RockAI claims that backpropagation is the biggest problem at this stage.
Backpropagation is a classic algorithm proposed by Turing Award and Nobel Prize winner Geoffrey Hinton; how could there be a problem? Everyone knows that backpropagation is used during the training phase and doesn't involve the inference phase. Forward propagation computes the results, while backpropagation calculates gradients based on errors, updating parameters through chain rule differentiation. It seems that not many in the AI field question backpropagation, but we have been questioning it for a long time and are researching improvements to backpropagation.
This research is quite niche; even many deep learning practitioners have not deeply studied the backpropagation process. For example, in CNNs or traditional MLP models, many may not fully understand backpropagation. I believe backpropagation is currently the biggest obstacle for large models; it significantly affects their development. Once you delve into the underlying technology, you will have mixed feelings about backpropagation: you love it for its effectiveness, but you also hate it for being too slow and taking too long to compute.
For the same task, traditional optimization methods such as gradient descent and quasi-Newton methods (L-BFGS) differ: gradient descent is based on first-order derivatives, while quasi-Newton methods approximate second-order derivatives (Newton's method is based on second-order derivatives), leading to different parameter descent processes. On the MNIST dataset (handwritten digits), the training curves of different algorithms show significant differences (as shown below).
From this perspective, second-order derivatives actually converge more efficiently than first-order derivatives, shortening the time. However, its drawback is the need to compute the Hessian matrix, which consumes a lot of computational power, making it hard to achieve a good balance between efficiency and computational power. Therefore, among Newton's method, quasi-Newton methods, and gradient descent, gradient descent still remains the most widely used in the industry. In PyTorch, while quasi-Newton methods are well-packaged, actual users are few essentially due to the process of backpropagation.
I mentioned earlier that the backpropagation algorithm was proposed by Hinton, but also questioned by himself. Many people may not have noticed that he later proposed the Forward-Forward algorithm in a paper. It has certain theoretical flaws, so although the results are acceptable, it has not been widely applied. The backpropagation algorithm transmits information forward and then back, carrying the errors back during the process. The Forward-Forward algorithm replaces one forward and one backward pass with two forward passes, using the idea of contrastive learning for computation. Hinton discusses the backpropagation algorithm in this paper, you can take a look if you're interested.
RockAI has also been conducting in-depth research on this. We have been quite successful in the non-transformer architecture direction (Yan architecture large model) and have entered into commercial cooperation. Beyond non-transformer, we are considering whether we can do something with backpropagation. Everyone knows that a major issue in machine learning is the "black box"; no one knows how the model arrives at its conclusions. It’s just a pile of network layers stacked together, fitting a complex expression that results in what we see. We have been thinking about whether we can combine backpropagation with visualization, designing an algorithm outside of backpropagation that can both converge the model and visualize the intermediate processes. This is a very challenging task in the field of machine learning.
We conducted an experiment based on MNIST to understand the reasons behind the model's predictions of specific digits.
For instance, if the predicted digit is 7, can we trace back through each layer of the network to see what each layer is thinking? The bottom layer is 7, the next layer is also 7, the layer above that is somewhat like 7 or 9, and further up it becomes unclear (as marked in red in the image above). Nevertheless, there is a gradual process, and we have a rough idea of how it predicts this digit. This is quite difficult in the backpropagation algorithm, which is why Attention was developed; it can visualize some relationships. We are also exploring interpretable machine learning methods outside of backpropagation. For instance, pushing back from the digit 6 (marked in yellow in the image above), we can roughly understand what the preceding process looks like. Pushing further back, 6 actually resembles 8 somewhat, so we can infer that they might sometimes be grouped together in certain datasets.
This process may seem like a small case; handwritten digits are a simple example, but it's very vital for large models since they lack interpretability. Large models can make nonsensical claims, and people might believe them to be true. Therefore, we need models that can explain themselves, which requires moving away from the backpropagation algorithm. I've always believed that backpropagation is a very "bad" algorithm, yet currently it's the most effective.
Future Development Trends of Edge AI
What might be the future development trends of edge AI? We believe it is collective intelligence. This may differ from everyone's understanding; models like those from OpenAI and DeepSeek seem promising, then why are we considering an alternative path?
As mentioned earlier, edge large models need to possess autonomous learning and memory capabilities, which means the backpropagation process must be able to run on edge devices, which is certainly not feasible at this stage. To achieve these two capabilities, algorithms must be improved. Once each device possesses autonomous learning and memory capabilities, it will shape real and unique individuals through learning and memory, just like humans. Our definition of collective intelligence is: several intelligent units with autonomous learning, working together through environmental perception, self-organization, and interactive collaboration to solve complex problems, thereby achieving overall intelligence enhancement in a constantly changing environment. This process is similar to the evolution of human society.
I believe that the history of the development of human civilization is the history of collective intelligence. All human intelligence, including every colleague in this room, is superior to that of primitive humans from ten thousand years ago; we know more than they did. Why are we better than them? Because in this era, we learn from each other within the group, enhancing each other's abilities through experiential learning, resulting in genuine intelligent generalization emergence. We believe that the emergence of intelligence should be a process of collective intelligence, rather than the process of knowledge collection and expression seen in large models. We have always hoped that collective intelligence will become a new consensus for AGI, which RockAI has pioneered in the industry.
Previously, some people questioned the process of collective intelligence, but by 2025, more and more people are recognizing it and taking action. The difficulties faced by traditional large models are beyond imagination, including data collection, insufficient computing power, and so on. The issue of data walls will certainly occur between 2026 and 2032. OpenAI has the most data, the most powerful data centers, and the model performance is acceptable, but that does not mean it can change the world; the problem lies in the algorithm.
Imagine that the books we read in elementary school may not even contain 1 million tokens, and the knowledge we learned was not extensive, yet we already exhibited good intelligence. Where's the problem? Isn't it a problem with the algorithm itself? Do we learn elementary knowledge only through books? Clearly not; we also learn through interactions with other children, where autonomous learning and memory capabilities come into play. This process is essentially the process of collective intelligence.
In 2024, we began running multimodal large models on Raspberry Pi, which has extremely low computing power. transformer architecture models with the same parameter scale cannot run on Raspberry Pi, but we not only can run them but also run multimodal models. We are currently working on synchronous autonomous learning, which means training and inference occur simultaneously. When the model performs inference and outputs results on the device, it's also learning at the same time. This process enhances the model's continuous evolution and environmental adaptability, allowing us to gradually move towards collective intelligence.
The Yan architecture large model can run directly on the CPU without using NPU. It's deployed on mobile phones without pruning, quantization, or distillation, and it responds quickly. In China, only Doubao (cloud large model) and we can achieve this.
During this year's Two Sessions, our partners deployed the Yan architecture large model. The offline deployment of the model on devices is also very smooth and seamless.
Therefore, the brand-new architecture is very important for the model. "Make Every Device Its Own Intelligence" is not just a slogan; the new architecture makes its realization possible. From this perspective, we believe that an important thing in the next stage is that intelligence will redefine hardware,and such a transformation will definitely happen quickly. The hardware form in the intelligent era will be different from now, and hardware and algorithms will be co-designed. Again, I want to emphasize that the traditional perception of edge large models being a smaller parameter version of cloud large models is definitely incorrect, because hardware with true intelligence will change our world. If one lacks a deep understanding of this trend, one might think that distilling a 70B transformer architecture large model down to 3B parameters and putting it into a device is sufficient. However, this is not a true edge large model, as it lacks autonomous learning and memory capabilities.
At the same time, the combination of edge and cloud will also undergo significant changes. There is a cognitive misconception here: understanding the edge-cloud combination as terminal devices lacking computing power and relying on the cloud for supplementation. The model of "insufficient edge, supplemented by cloud" will definitely be abandoned, as we must abandon the concept of large models and small models.
The human brain has approximately 86 billion parameters, while the neuronal parameters of a mouse, although fewer than those of humans, still possess intelligence. The level of intelligence of a model is related to the number of parameters, but this relationship is nonlinear; a smaller number of model parameters does not equate to weaker capabilities. Currently, models with hundreds of billions or trillions of parameters do not match the intelligence level of a mouse. Is this a problem of parameter quantity? The answer is no; this technological approach needs to be changed.

