LLMs on iPhone
The research paper (first spotted by VentureBeat) is titled “LLM in a flash: Efficient Large Language Model Inference with Limited Memory”. It tackles the key challenge that is running on-device LLMs, especially for devices with limited DRAM capacity. For the unaware, LLMs contain billions of parameters. Thus, making them run on devices with restricted DRAM poses a challenge. To solve this problem, the paper suggests that LLMs can be run on-device by storing the model parameters on flash memory but bringing them on demand to DRAM.
Keivan Alizadeh, a Machine Learning Engineer at Apple and lead author of the paper said, “Our method involves constructing an inference cost model that harmonizes with the flash memory behavior, guiding us to optimize in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks.”
Not sure which
mobile to buy?
The team used two principle techniques – “Windowing, and row-column bundling. Windowing reuses previously activated neurons to reduce the data transfer, while row-column bundling increases the size of data chunks read from flash memory. Both of these techniques have led to a 4-5x increase in the Apple M1 Max SoC.
In theory, this context-adaptive loading could pave the way for running LLMs on devices with limited memory such as iPhones and iPads.
Source: tech.hindustantimes.com