ENHANCING RESOURCE UTILIZATION IN CLOUD-NATIVE CLUSTERS THROUGH CUSTOM SCHEDULING
Optimizing resource utilization and performance in cloud-native environments has become increasingly critical as applications grow in complexity and demand specialized resources such as graphics processing units (GPUs). This study explores how enhancements to Kubernetes control plane, specifically through advanced schedulers and autoscalers, can improve resource efficiency and application performance. Focusing on environments sensitive to network conditions and requiring GPU resources, the research investigates how these control plane components have been optimized to meet the growing demands of modern applications.
The research focuses on two key areas: implementing network-aware scheduling and developing GPU autoscaling mechanisms based on real-time demand. Network-aware scheduling strategies have been proposed and tested to reduce communication overhead between microservices by considering factors such as latency and bandwidth. A comparative analysis between Kubernetes' default scheduler and custom extension plugins reveals the trade-offs between simplicity and efficiency in resource allocation decisions. In addition, custom autoscaling mechanisms have been developed to dynamically manage GPU resources, ensuring that workloads are handled with optimal performance under varying conditions. The findings demonstrate that extending the Kubernetes control plane with custom scheduling and autoscaling techniques can significantly enhance resource utilization and application performance.
This study shows that such improvements reduce latency, boost workload performance, and enable better resource allocation in cloud-native environments, particularly those with stringent performance requirements and high demand for specialized resources. The proposed solution, network-aware scheduler (NAS), reduces average latency by 52.66% compared to the default Kubernetes scheduler and 2.68% compared to Diktyo while minimizing maximum latency spikes by 85.61% and 7.23%, respectively. Furthermore, NAS effectively distributes workloads and provides co-location benefits by considering microservice dependencies and network costs during pod placement. In parallel, the dynamic GPU pod autoscaling strategy, driven by real-time model server metrics such as token throughput and Key-Value (KV) cache utilization, successfully reduced the Time to First Token (TTFT) by approximately 70%, stabilizing performance even under high-load conditions (1,000 QPS) and maintaining GPU KV cache utilization below 45% after scaling.
History
Degree Type
- Master of Science
Department
- Computer and Information Technology
Campus location
- West Lafayette