ENHANCING RESOURCE UTILIZATION IN CLOUD-NATIVE CLUSTERS THROUGH CUSTOM SCHEDULING

Ragavan, Vivek Karunai Kiri

doi:10.25394/PGS.28899827.v1

ENHANCING RESOURCE UTILIZATION IN CLOUD-NATIVE CLUSTERS THROUGH CUSTOM SCHEDULING

thesis

posted on 2025-05-02, 17:33 authored by Vivek Karunai Kiri RagavanVivek Karunai Kiri Ragavan

Optimizing resource utilization and performance in cloud-native environments has become increasingly critical as applications grow in complexity and demand specialized resources such as graphics processing units (GPUs). This study explores how enhancements to Kubernetes control plane, specifically through advanced schedulers and autoscalers, can improve resource efficiency and application performance. Focusing on environments sensitive to network conditions and requiring GPU resources, the research investigates how these control plane components have been optimized to meet the growing demands of modern applications.

The research focuses on two key areas: implementing network-aware scheduling and developing GPU autoscaling mechanisms based on real-time demand. Network-aware scheduling strategies have been proposed and tested to reduce communication overhead between microservices by considering factors such as latency and bandwidth. A comparative analysis between Kubernetes' default scheduler and custom extension plugins reveals the trade-offs between simplicity and efficiency in resource allocation decisions. In addition, custom autoscaling mechanisms have been developed to dynamically manage GPU resources, ensuring that workloads are handled with optimal performance under varying conditions. The findings demonstrate that extending the Kubernetes control plane with custom scheduling and autoscaling techniques can significantly enhance resource utilization and application performance.

This study shows that such improvements reduce latency, boost workload performance, and enable better resource allocation in cloud-native environments, particularly those with stringent performance requirements and high demand for specialized resources. The proposed solution, network-aware scheduler (NAS), reduces average latency by 52.66% compared to the default Kubernetes scheduler and 2.68% compared to Diktyo while minimizing maximum latency spikes by 85.61% and 7.23%, respectively. Furthermore, NAS effectively distributes workloads and provides co-location benefits by considering microservice dependencies and network costs during pod placement. In parallel, the dynamic GPU pod autoscaling strategy, driven by real-time model server metrics such as token throughput and Key-Value (KV) cache utilization, successfully reduced the Time to First Token (TTFT) by approximately 70%, stabilizing performance even under high-load conditions (1,000 QPS) and maintaining GPU KV cache utilization below 45% after scaling.

History

Degree Type

Master of Science

Department

Computer and Information Technology

Campus location

West Lafayette

Advisor/Supervisor/Committee Chair

Deepak Nadig

Additional Committee Member 2

Thomas J. Hacker

Additional Committee Member 3

Erik Gough

Usage metrics

Keywords

Network-Aware Scheduling Kubernetes Scheduling Framework Large Language Models (LLMs) in Kubernetes Dynamic Inference Workload Autoscaling Kubernetes, Horizontal Pod Autoscaler (HPA)

Licence

CC BY 4.0

ENHANCING RESOURCE UTILIZATION IN CLOUD-NATIVE CLUSTERS THROUGH CUSTOM SCHEDULING

History

Degree Type

Department

Campus location

Advisor/Supervisor/Committee Chair

Additional Committee Member 2

Additional Committee Member 3

Usage metrics

Categories

Keywords

Licence

Exports