figshare
Browse

EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices

Download (68.6 MB)
software
posted on 2025-03-27, 10:27 authored by Zheyu ShenZheyu Shen

Large Language Models (LLMs) have gained significant attention due to their versatility across a wide array of applications. Fine-tuning LLMs with parameter-efficient adapters, such as Low-Rank Adaptation (LoRA), enables these models to efficiently adapt to downstream tasks without extensive retraining.

Deploying fine-tuned LLMs on multi-tenant edge devices offers substantial benefits, such as reduced latency, enhanced privacy, and personalized responses.

However, serving LLMs efficiently on resource-constrained edge devices presents critical challenges, including the complexity of adapter selection for different tasks, memory overhead from frequent adapter swapping. Moreover, given the multiple requests in the multi-tenant settings, processing requests sequentially will result in underutilization of computational resources and significant latency.

This paper introduces
ame, an efficient system for serving LLMs on edge devices in multi-tenant environments.


ame\ incorporates three key innovations: (1) an adaptive adapter selection mechanism to streamline the adapter configuration process; (2) heterogeneous memory management, leveraging intelligent adapter caching and pooling to mitigate memory operation overhead; and (3) batch LoRA inference, which enables efficient batch processing to significantly reduce computational latency.

Comprehensive evaluations using the Llama3.1-8B model demonstrates that
ame\ significantly outperforms the status quo (\ie \textsf{llama.cpp}) in terms of both latency and throughput. The results demonstrates
ame\ could achieve up to 4$\times$ boost in throughput with less energy consumption. Even more impressively, it manages to serve several orders of magnitude more adapters simultaneously without sacrificing inference performance.

These results highlight
ame’s potential to transform edge deployment of LLMs in multi-tenant scenarios, offering a scalable and efficient solution for resource-constrained environments.

History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC